Files

Teknium d417ba2a48 feat: add route-aware pricing estimates (#1695 )

Salvaged from PR #1563 by @kshitijk4poor. Cherry-picked with authorship preserved.

- Route-aware pricing architecture replacing static MODEL_PRICING + heuristics
- Canonical usage normalization (Anthropic/OpenAI/Codex API shapes)
- Cache-aware billing (separate cache_read/cache_write rates)
- Cost status tracking (estimated/included/unknown/actual)
- OpenRouter live pricing via models API
- Schema migration v4→v5 with billing metadata columns
- Removed speculative forward-looking entries
- Removed cost display from CLI status bar
- Threaded OpenRouter metadata pre-warm

Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>

2026-03-17 03:44:44 -07:00

14 KiB

Raw Permalink Blame History

Pricing Accuracy Architecture

Date: 2026-03-16

Goal

Hermes should only show dollar costs when they are backed by an official source for the user's actual billing path.

This design replaces the current static, heuristic pricing flow in:

run_agent.py
agent/usage_pricing.py
agent/insights.py
cli.py

with a provider-aware pricing system that:

handles cache billing correctly
distinguishes actual vs estimated vs included vs unknown
reconciles post-hoc costs when providers expose authoritative billing data
supports direct providers, OpenRouter, subscriptions, enterprise pricing, and custom endpoints

Problems In The Current Design

Current Hermes behavior has four structural issues:

It stores only prompt_tokens and completion_tokens, which is insufficient for providers that bill cache reads and cache writes separately.
It uses a static model price table and fuzzy heuristics, which can drift from current official pricing.
It assumes public API list pricing matches the user's real billing path.
It has no distinction between live estimates and reconciled billed cost.

Design Principles

Normalize usage before pricing.
Never fold cached tokens into plain input cost.
Track certainty explicitly.
Treat the billing path as part of the model identity.
Prefer official machine-readable sources over scraped docs.
Use post-hoc provider cost APIs when available.
Show n/a rather than inventing precision.

High-Level Architecture

The new system has four layers:

usage_normalization Converts raw provider usage into a canonical usage record.
pricing_source_resolution Determines the billing path, source of truth, and applicable pricing source.
cost_estimation_and_reconciliation Produces an immediate estimate when possible, then replaces or annotates it with actual billed cost later.
presentation /usage, /insights, and the status bar display cost with certainty metadata.

Canonical Usage Record

Add a canonical usage model that every provider path maps into before any pricing math happens.

Suggested structure:

@dataclass
class CanonicalUsage:
    provider: str
    billing_provider: str
    model: str
    billing_route: str

    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0
    reasoning_tokens: int = 0
    request_count: int = 1

    raw_usage: dict[str, Any] | None = None
    raw_usage_fields: dict[str, str] | None = None
    computed_fields: set[str] | None = None

    provider_request_id: str | None = None
    provider_generation_id: str | None = None
    provider_response_id: str | None = None

Rules:

input_tokens means non-cached input only.
cache_read_tokens and cache_write_tokens are never merged into input_tokens.
output_tokens excludes cache metrics.
reasoning_tokens is telemetry unless a provider officially bills it separately.

This is the same normalization pattern used by opencode, extended with provenance and reconciliation ids.

Provider Normalization Rules

OpenAI Direct

Source usage fields:

prompt_tokens
completion_tokens
prompt_tokens_details.cached_tokens

Normalization:

cache_read_tokens = cached_tokens
input_tokens = prompt_tokens - cached_tokens
cache_write_tokens = 0 unless OpenAI exposes it in the relevant route
output_tokens = completion_tokens

Anthropic Direct

Source usage fields:

input_tokens
output_tokens
cache_read_input_tokens
cache_creation_input_tokens

Normalization:

input_tokens = input_tokens
output_tokens = output_tokens
cache_read_tokens = cache_read_input_tokens
cache_write_tokens = cache_creation_input_tokens

OpenRouter

Estimate-time usage normalization should use the response usage payload with the same rules as the underlying provider when possible.

Reconciliation-time records should also store:

OpenRouter generation id
native token fields when available
total_cost
cache_discount
upstream_inference_cost
is_byok

Gemini / Vertex

Use official Gemini or Vertex usage fields where available.

If cached content tokens are exposed:

map them to cache_read_tokens

If a route exposes no cache creation metric:

store cache_write_tokens = 0
preserve the raw usage payload for later extension

DeepSeek And Other Direct Providers

Normalize only the fields that are officially exposed.

If a provider does not expose cache buckets:

do not infer them unless the provider explicitly documents how to derive them

Subscription / Included-Cost Routes

These still use the canonical usage model.

Tokens are tracked normally. Cost depends on billing mode, not on whether usage exists.

Billing Route Model

Hermes must stop keying pricing solely by model.

Introduce a billing route descriptor:

@dataclass
class BillingRoute:
    provider: str
    base_url: str | None
    model: str
    billing_mode: str
    organization_hint: str | None = None

billing_mode values:

official_cost_api
official_generation_api
official_models_api
official_docs_snapshot
subscription_included
user_override
custom_contract
unknown

Examples:

OpenAI direct API with Costs API access: official_cost_api
Anthropic direct API with Usage & Cost API access: official_cost_api
OpenRouter request before reconciliation: official_models_api
OpenRouter request after generation lookup: official_generation_api
GitHub Copilot style subscription route: subscription_included
local OpenAI-compatible server: unknown
enterprise contract with configured rates: custom_contract

Cost Status Model

Every displayed cost should have:

@dataclass
class CostResult:
    amount_usd: Decimal | None
    status: Literal["actual", "estimated", "included", "unknown"]
    source: Literal[
        "provider_cost_api",
        "provider_generation_api",
        "provider_models_api",
        "official_docs_snapshot",
        "user_override",
        "custom_contract",
        "none",
    ]
    label: str
    fetched_at: datetime | None
    pricing_version: str | None
    notes: list[str]

Presentation rules:

actual: show dollar amount as final
estimated: show dollar amount with estimate labeling
included: show included or $0.00 (included) depending on UX choice
unknown: show n/a

Official Source Hierarchy

Resolve cost using this order:

Request-level or account-level official billed cost
Official machine-readable model pricing
Official docs snapshot
User override or custom contract
Unknown

The system must never skip to a lower level if a higher-confidence source exists for the current billing route.

Provider-Specific Truth Rules

OpenAI Direct

Preferred truth:

Costs API for reconciled spend
Official pricing page for live estimate

Anthropic Direct

Preferred truth:

Usage & Cost API for reconciled spend
Official pricing docs for live estimate

OpenRouter

Preferred truth:

GET /api/v1/generation for reconciled total_cost
GET /api/v1/models pricing for live estimate

Do not use underlying provider public pricing as the source of truth for OpenRouter billing.

Gemini / Vertex

Preferred truth:

official billing export or billing API for reconciled spend when available for the route
official pricing docs for estimate

DeepSeek

Preferred truth:

official machine-readable cost source if available in the future
official pricing docs snapshot today

Subscription-Included Routes

Preferred truth:

explicit route config marking the model as included in subscription

These should display included, not an API list-price estimate.

Custom Endpoint / Local Model

Preferred truth:

user override
custom contract config
unknown

These should default to unknown.

Pricing Catalog

Replace the current MODEL_PRICING dict with a richer pricing catalog.

Suggested record:

@dataclass
class PricingEntry:
    provider: str
    route_pattern: str
    model_pattern: str

    input_cost_per_million: Decimal | None = None
    output_cost_per_million: Decimal | None = None
    cache_read_cost_per_million: Decimal | None = None
    cache_write_cost_per_million: Decimal | None = None
    request_cost: Decimal | None = None
    image_cost: Decimal | None = None

    source: str = "official_docs_snapshot"
    source_url: str | None = None
    fetched_at: datetime | None = None
    pricing_version: str | None = None

The catalog should be route-aware:

openai:gpt-5
anthropic:claude-opus-4-6
openrouter:anthropic/claude-opus-4.6
copilot:gpt-4o

This avoids conflating direct-provider billing with aggregator billing.

Pricing Sync Architecture

Introduce a pricing sync subsystem instead of manually maintaining a single hardcoded table.

Suggested modules:

agent/pricing/catalog.py
agent/pricing/sources.py
agent/pricing/sync.py
agent/pricing/reconcile.py
agent/pricing/types.py

Sync Sources

OpenRouter models API
official provider docs snapshots where no API exists
user overrides from config

Sync Output

Cache pricing entries locally with:

source URL
fetch timestamp
version/hash
confidence/source type

Sync Frequency

startup warm cache
background refresh every 6 to 24 hours depending on source
manual hermes pricing sync

Reconciliation Architecture

Live requests may produce only an estimate initially. Hermes should reconcile them later when a provider exposes actual billed cost.

Suggested flow:

Agent call completes.
Hermes stores canonical usage plus reconciliation ids.
Hermes computes an immediate estimate if a pricing source exists.
A reconciliation worker fetches actual cost when supported.
Session and message records are updated with actual cost.

This can run:

inline for cheap lookups
asynchronously for delayed provider accounting

Persistence Changes

Session storage should stop storing only aggregate prompt/completion totals.

Add fields for both usage and cost certainty:

input_tokens
output_tokens
cache_read_tokens
cache_write_tokens
reasoning_tokens
estimated_cost_usd
actual_cost_usd
cost_status
cost_source
pricing_version
billing_provider
billing_mode

If schema expansion is too large for one PR, add a new pricing events table:

session_cost_events
  id
  session_id
  request_id
  provider
  model
  billing_mode
  input_tokens
  output_tokens
  cache_read_tokens
  cache_write_tokens
  estimated_cost_usd
  actual_cost_usd
  cost_status
  cost_source
  pricing_version
  created_at
  updated_at

Hermes Touchpoints

`run_agent.py`

Current responsibility:

parse raw provider usage
update session token counters

New responsibility:

build CanonicalUsage
update canonical counters
store reconciliation ids
emit usage event to pricing subsystem

`agent/usage_pricing.py`

Current responsibility:

static lookup table
direct cost arithmetic

New responsibility:

move or replace with pricing catalog facade
no fuzzy model-family heuristics
no direct pricing without billing-route context

`cli.py`

Current responsibility:

compute session cost directly from prompt/completion totals

New responsibility:

display CostResult
show status badges:
- actual
- estimated
- included
- n/a

`agent/insights.py`

Current responsibility:

recompute historical estimates from static pricing

New responsibility:

aggregate stored pricing events
prefer actual cost over estimate
surface estimates only when reconciliation is unavailable

UX Rules

Status Bar

Show one of:

$1.42
~$1.42
included
cost n/a

Where:

$1.42 means actual
~$1.42 means estimated
included means subscription-backed or explicitly zero-cost route
cost n/a means unknown

`/usage`

Show:

token buckets
estimated cost
actual cost if available
cost status
pricing source

`/insights`

Aggregate:

actual cost totals
estimated-only totals
unknown-cost sessions count
included-cost sessions count

Config And Overrides

Add user-configurable pricing overrides in config:

pricing:
  mode: hybrid
  sync_on_startup: true
  sync_interval_hours: 12
  overrides:
    - provider: openrouter
      model: anthropic/claude-opus-4.6
      billing_mode: custom_contract
      input_cost_per_million: 4.25
      output_cost_per_million: 22.0
      cache_read_cost_per_million: 0.5
      cache_write_cost_per_million: 6.0
  included_routes:
    - provider: copilot
      model: "*"
    - provider: codex-subscription
      model: "*"

Overrides must win over catalog defaults for the matching billing route.

Rollout Plan

Phase 1

add canonical usage model
split cache token buckets in run_agent.py
stop pricing cache-inflated prompt totals
preserve current UI with improved backend math

Phase 2

add route-aware pricing catalog
integrate OpenRouter models API sync
add estimated vs included vs unknown

Phase 3

add reconciliation for OpenRouter generation cost
add actual cost persistence
update /insights to prefer actual cost

Phase 4

add direct OpenAI and Anthropic reconciliation paths
add user overrides and contract pricing
add pricing sync CLI command

Testing Strategy

Add tests for:

OpenAI cached token subtraction
Anthropic cache read/write separation
OpenRouter estimated vs actual reconciliation
subscription-backed models showing included
custom endpoints showing n/a
override precedence
stale catalog fallback behavior

Current tests that assume heuristic pricing should be replaced with route-aware expectations.

Non-Goals

exact enterprise billing reconstruction without an official source or user override
backfilling perfect historical cost for old sessions that lack cache bucket data
scraping arbitrary provider web pages at request time

Recommendation

Do not expand the existing MODEL_PRICING dict.

That path cannot satisfy the product requirement. Hermes should instead migrate to:

canonical usage normalization
route-aware pricing sources
estimate-then-reconcile cost lifecycle
explicit certainty states in the UI

This is the minimum architecture that makes the statement "Hermes pricing is backed by official sources where possible, and otherwise clearly labeled" defensible.

14 KiB Raw Permalink Blame History

Pricing Accuracy Architecture

Goal

Problems In The Current Design

Design Principles

High-Level Architecture

Canonical Usage Record

Provider Normalization Rules

OpenAI Direct

Anthropic Direct

OpenRouter

Gemini / Vertex

DeepSeek And Other Direct Providers

Subscription / Included-Cost Routes

Billing Route Model

Cost Status Model

Official Source Hierarchy

Provider-Specific Truth Rules

OpenAI Direct

Anthropic Direct

OpenRouter

Gemini / Vertex

DeepSeek

Subscription-Included Routes

Custom Endpoint / Local Model

Pricing Catalog

Pricing Sync Architecture

Sync Sources

Sync Output

Sync Frequency

Reconciliation Architecture

Persistence Changes

Hermes Touchpoints

run_agent.py

agent/usage_pricing.py

cli.py

agent/insights.py

UX Rules

Status Bar

/usage

/insights

Config And Overrides

Rollout Plan

Phase 1

Phase 2

Phase 3

Phase 4

Testing Strategy

Non-Goals

Recommendation

14 KiB

Raw Permalink Blame History

`run_agent.py`

`agent/usage_pricing.py`

`cli.py`

`agent/insights.py`

`/usage`

`/insights`