Files
hermes-agent/docs/plans/2026-03-16-pricing-accuracy-architecture-design.md
Teknium d417ba2a48 feat: add route-aware pricing estimates (#1695)
Salvaged from PR #1563 by @kshitijk4poor. Cherry-picked with authorship preserved.

- Route-aware pricing architecture replacing static MODEL_PRICING + heuristics
- Canonical usage normalization (Anthropic/OpenAI/Codex API shapes)
- Cache-aware billing (separate cache_read/cache_write rates)
- Cost status tracking (estimated/included/unknown/actual)
- OpenRouter live pricing via models API
- Schema migration v4→v5 with billing metadata columns
- Removed speculative forward-looking entries
- Removed cost display from CLI status bar
- Threaded OpenRouter metadata pre-warm

Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
2026-03-17 03:44:44 -07:00

14 KiB

Pricing Accuracy Architecture

Date: 2026-03-16

Goal

Hermes should only show dollar costs when they are backed by an official source for the user's actual billing path.

This design replaces the current static, heuristic pricing flow in:

  • run_agent.py
  • agent/usage_pricing.py
  • agent/insights.py
  • cli.py

with a provider-aware pricing system that:

  • handles cache billing correctly
  • distinguishes actual vs estimated vs included vs unknown
  • reconciles post-hoc costs when providers expose authoritative billing data
  • supports direct providers, OpenRouter, subscriptions, enterprise pricing, and custom endpoints

Problems In The Current Design

Current Hermes behavior has four structural issues:

  1. It stores only prompt_tokens and completion_tokens, which is insufficient for providers that bill cache reads and cache writes separately.
  2. It uses a static model price table and fuzzy heuristics, which can drift from current official pricing.
  3. It assumes public API list pricing matches the user's real billing path.
  4. It has no distinction between live estimates and reconciled billed cost.

Design Principles

  1. Normalize usage before pricing.
  2. Never fold cached tokens into plain input cost.
  3. Track certainty explicitly.
  4. Treat the billing path as part of the model identity.
  5. Prefer official machine-readable sources over scraped docs.
  6. Use post-hoc provider cost APIs when available.
  7. Show n/a rather than inventing precision.

High-Level Architecture

The new system has four layers:

  1. usage_normalization Converts raw provider usage into a canonical usage record.
  2. pricing_source_resolution Determines the billing path, source of truth, and applicable pricing source.
  3. cost_estimation_and_reconciliation Produces an immediate estimate when possible, then replaces or annotates it with actual billed cost later.
  4. presentation /usage, /insights, and the status bar display cost with certainty metadata.

Canonical Usage Record

Add a canonical usage model that every provider path maps into before any pricing math happens.

Suggested structure:

@dataclass
class CanonicalUsage:
    provider: str
    billing_provider: str
    model: str
    billing_route: str

    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0
    reasoning_tokens: int = 0
    request_count: int = 1

    raw_usage: dict[str, Any] | None = None
    raw_usage_fields: dict[str, str] | None = None
    computed_fields: set[str] | None = None

    provider_request_id: str | None = None
    provider_generation_id: str | None = None
    provider_response_id: str | None = None

Rules:

  • input_tokens means non-cached input only.
  • cache_read_tokens and cache_write_tokens are never merged into input_tokens.
  • output_tokens excludes cache metrics.
  • reasoning_tokens is telemetry unless a provider officially bills it separately.

This is the same normalization pattern used by opencode, extended with provenance and reconciliation ids.

Provider Normalization Rules

OpenAI Direct

Source usage fields:

  • prompt_tokens
  • completion_tokens
  • prompt_tokens_details.cached_tokens

Normalization:

  • cache_read_tokens = cached_tokens
  • input_tokens = prompt_tokens - cached_tokens
  • cache_write_tokens = 0 unless OpenAI exposes it in the relevant route
  • output_tokens = completion_tokens

Anthropic Direct

Source usage fields:

  • input_tokens
  • output_tokens
  • cache_read_input_tokens
  • cache_creation_input_tokens

Normalization:

  • input_tokens = input_tokens
  • output_tokens = output_tokens
  • cache_read_tokens = cache_read_input_tokens
  • cache_write_tokens = cache_creation_input_tokens

OpenRouter

Estimate-time usage normalization should use the response usage payload with the same rules as the underlying provider when possible.

Reconciliation-time records should also store:

  • OpenRouter generation id
  • native token fields when available
  • total_cost
  • cache_discount
  • upstream_inference_cost
  • is_byok

Gemini / Vertex

Use official Gemini or Vertex usage fields where available.

If cached content tokens are exposed:

  • map them to cache_read_tokens

If a route exposes no cache creation metric:

  • store cache_write_tokens = 0
  • preserve the raw usage payload for later extension

DeepSeek And Other Direct Providers

Normalize only the fields that are officially exposed.

If a provider does not expose cache buckets:

  • do not infer them unless the provider explicitly documents how to derive them

Subscription / Included-Cost Routes

These still use the canonical usage model.

Tokens are tracked normally. Cost depends on billing mode, not on whether usage exists.

Billing Route Model

Hermes must stop keying pricing solely by model.

Introduce a billing route descriptor:

@dataclass
class BillingRoute:
    provider: str
    base_url: str | None
    model: str
    billing_mode: str
    organization_hint: str | None = None

billing_mode values:

  • official_cost_api
  • official_generation_api
  • official_models_api
  • official_docs_snapshot
  • subscription_included
  • user_override
  • custom_contract
  • unknown

Examples:

  • OpenAI direct API with Costs API access: official_cost_api
  • Anthropic direct API with Usage & Cost API access: official_cost_api
  • OpenRouter request before reconciliation: official_models_api
  • OpenRouter request after generation lookup: official_generation_api
  • GitHub Copilot style subscription route: subscription_included
  • local OpenAI-compatible server: unknown
  • enterprise contract with configured rates: custom_contract

Cost Status Model

Every displayed cost should have:

@dataclass
class CostResult:
    amount_usd: Decimal | None
    status: Literal["actual", "estimated", "included", "unknown"]
    source: Literal[
        "provider_cost_api",
        "provider_generation_api",
        "provider_models_api",
        "official_docs_snapshot",
        "user_override",
        "custom_contract",
        "none",
    ]
    label: str
    fetched_at: datetime | None
    pricing_version: str | None
    notes: list[str]

Presentation rules:

  • actual: show dollar amount as final
  • estimated: show dollar amount with estimate labeling
  • included: show included or $0.00 (included) depending on UX choice
  • unknown: show n/a

Official Source Hierarchy

Resolve cost using this order:

  1. Request-level or account-level official billed cost
  2. Official machine-readable model pricing
  3. Official docs snapshot
  4. User override or custom contract
  5. Unknown

The system must never skip to a lower level if a higher-confidence source exists for the current billing route.

Provider-Specific Truth Rules

OpenAI Direct

Preferred truth:

  1. Costs API for reconciled spend
  2. Official pricing page for live estimate

Anthropic Direct

Preferred truth:

  1. Usage & Cost API for reconciled spend
  2. Official pricing docs for live estimate

OpenRouter

Preferred truth:

  1. GET /api/v1/generation for reconciled total_cost
  2. GET /api/v1/models pricing for live estimate

Do not use underlying provider public pricing as the source of truth for OpenRouter billing.

Gemini / Vertex

Preferred truth:

  1. official billing export or billing API for reconciled spend when available for the route
  2. official pricing docs for estimate

DeepSeek

Preferred truth:

  1. official machine-readable cost source if available in the future
  2. official pricing docs snapshot today

Subscription-Included Routes

Preferred truth:

  1. explicit route config marking the model as included in subscription

These should display included, not an API list-price estimate.

Custom Endpoint / Local Model

Preferred truth:

  1. user override
  2. custom contract config
  3. unknown

These should default to unknown.

Pricing Catalog

Replace the current MODEL_PRICING dict with a richer pricing catalog.

Suggested record:

@dataclass
class PricingEntry:
    provider: str
    route_pattern: str
    model_pattern: str

    input_cost_per_million: Decimal | None = None
    output_cost_per_million: Decimal | None = None
    cache_read_cost_per_million: Decimal | None = None
    cache_write_cost_per_million: Decimal | None = None
    request_cost: Decimal | None = None
    image_cost: Decimal | None = None

    source: str = "official_docs_snapshot"
    source_url: str | None = None
    fetched_at: datetime | None = None
    pricing_version: str | None = None

The catalog should be route-aware:

  • openai:gpt-5
  • anthropic:claude-opus-4-6
  • openrouter:anthropic/claude-opus-4.6
  • copilot:gpt-4o

This avoids conflating direct-provider billing with aggregator billing.

Pricing Sync Architecture

Introduce a pricing sync subsystem instead of manually maintaining a single hardcoded table.

Suggested modules:

  • agent/pricing/catalog.py
  • agent/pricing/sources.py
  • agent/pricing/sync.py
  • agent/pricing/reconcile.py
  • agent/pricing/types.py

Sync Sources

  • OpenRouter models API
  • official provider docs snapshots where no API exists
  • user overrides from config

Sync Output

Cache pricing entries locally with:

  • source URL
  • fetch timestamp
  • version/hash
  • confidence/source type

Sync Frequency

  • startup warm cache
  • background refresh every 6 to 24 hours depending on source
  • manual hermes pricing sync

Reconciliation Architecture

Live requests may produce only an estimate initially. Hermes should reconcile them later when a provider exposes actual billed cost.

Suggested flow:

  1. Agent call completes.
  2. Hermes stores canonical usage plus reconciliation ids.
  3. Hermes computes an immediate estimate if a pricing source exists.
  4. A reconciliation worker fetches actual cost when supported.
  5. Session and message records are updated with actual cost.

This can run:

  • inline for cheap lookups
  • asynchronously for delayed provider accounting

Persistence Changes

Session storage should stop storing only aggregate prompt/completion totals.

Add fields for both usage and cost certainty:

  • input_tokens
  • output_tokens
  • cache_read_tokens
  • cache_write_tokens
  • reasoning_tokens
  • estimated_cost_usd
  • actual_cost_usd
  • cost_status
  • cost_source
  • pricing_version
  • billing_provider
  • billing_mode

If schema expansion is too large for one PR, add a new pricing events table:

session_cost_events
  id
  session_id
  request_id
  provider
  model
  billing_mode
  input_tokens
  output_tokens
  cache_read_tokens
  cache_write_tokens
  estimated_cost_usd
  actual_cost_usd
  cost_status
  cost_source
  pricing_version
  created_at
  updated_at

Hermes Touchpoints

run_agent.py

Current responsibility:

  • parse raw provider usage
  • update session token counters

New responsibility:

  • build CanonicalUsage
  • update canonical counters
  • store reconciliation ids
  • emit usage event to pricing subsystem

agent/usage_pricing.py

Current responsibility:

  • static lookup table
  • direct cost arithmetic

New responsibility:

  • move or replace with pricing catalog facade
  • no fuzzy model-family heuristics
  • no direct pricing without billing-route context

cli.py

Current responsibility:

  • compute session cost directly from prompt/completion totals

New responsibility:

  • display CostResult
  • show status badges:
    • actual
    • estimated
    • included
    • n/a

agent/insights.py

Current responsibility:

  • recompute historical estimates from static pricing

New responsibility:

  • aggregate stored pricing events
  • prefer actual cost over estimate
  • surface estimates only when reconciliation is unavailable

UX Rules

Status Bar

Show one of:

  • $1.42
  • ~$1.42
  • included
  • cost n/a

Where:

  • $1.42 means actual
  • ~$1.42 means estimated
  • included means subscription-backed or explicitly zero-cost route
  • cost n/a means unknown

/usage

Show:

  • token buckets
  • estimated cost
  • actual cost if available
  • cost status
  • pricing source

/insights

Aggregate:

  • actual cost totals
  • estimated-only totals
  • unknown-cost sessions count
  • included-cost sessions count

Config And Overrides

Add user-configurable pricing overrides in config:

pricing:
  mode: hybrid
  sync_on_startup: true
  sync_interval_hours: 12
  overrides:
    - provider: openrouter
      model: anthropic/claude-opus-4.6
      billing_mode: custom_contract
      input_cost_per_million: 4.25
      output_cost_per_million: 22.0
      cache_read_cost_per_million: 0.5
      cache_write_cost_per_million: 6.0
  included_routes:
    - provider: copilot
      model: "*"
    - provider: codex-subscription
      model: "*"

Overrides must win over catalog defaults for the matching billing route.

Rollout Plan

Phase 1

  • add canonical usage model
  • split cache token buckets in run_agent.py
  • stop pricing cache-inflated prompt totals
  • preserve current UI with improved backend math

Phase 2

  • add route-aware pricing catalog
  • integrate OpenRouter models API sync
  • add estimated vs included vs unknown

Phase 3

  • add reconciliation for OpenRouter generation cost
  • add actual cost persistence
  • update /insights to prefer actual cost

Phase 4

  • add direct OpenAI and Anthropic reconciliation paths
  • add user overrides and contract pricing
  • add pricing sync CLI command

Testing Strategy

Add tests for:

  • OpenAI cached token subtraction
  • Anthropic cache read/write separation
  • OpenRouter estimated vs actual reconciliation
  • subscription-backed models showing included
  • custom endpoints showing n/a
  • override precedence
  • stale catalog fallback behavior

Current tests that assume heuristic pricing should be replaced with route-aware expectations.

Non-Goals

  • exact enterprise billing reconstruction without an official source or user override
  • backfilling perfect historical cost for old sessions that lack cache bucket data
  • scraping arbitrary provider web pages at request time

Recommendation

Do not expand the existing MODEL_PRICING dict.

That path cannot satisfy the product requirement. Hermes should instead migrate to:

  • canonical usage normalization
  • route-aware pricing sources
  • estimate-then-reconcile cost lifecycle
  • explicit certainty states in the UI

This is the minimum architecture that makes the statement "Hermes pricing is backed by official sources where possible, and otherwise clearly labeled" defensible.