feat: add route-aware pricing estimates (#1695)
Salvaged from PR #1563 by @kshitijk4poor. Cherry-picked with authorship preserved. - Route-aware pricing architecture replacing static MODEL_PRICING + heuristics - Canonical usage normalization (Anthropic/OpenAI/Codex API shapes) - Cache-aware billing (separate cache_read/cache_write rates) - Cost status tracking (estimated/included/unknown/actual) - OpenRouter live pricing via models API - Schema migration v4→v5 with billing metadata columns - Removed speculative forward-looking entries - Removed cost display from CLI status bar - Threaded OpenRouter metadata pre-warm Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
This commit is contained in:
608
docs/plans/2026-03-16-pricing-accuracy-architecture-design.md
Normal file
608
docs/plans/2026-03-16-pricing-accuracy-architecture-design.md
Normal file
@@ -0,0 +1,608 @@
|
||||
# Pricing Accuracy Architecture
|
||||
|
||||
Date: 2026-03-16
|
||||
|
||||
## Goal
|
||||
|
||||
Hermes should only show dollar costs when they are backed by an official source for the user's actual billing path.
|
||||
|
||||
This design replaces the current static, heuristic pricing flow in:
|
||||
|
||||
- `run_agent.py`
|
||||
- `agent/usage_pricing.py`
|
||||
- `agent/insights.py`
|
||||
- `cli.py`
|
||||
|
||||
with a provider-aware pricing system that:
|
||||
|
||||
- handles cache billing correctly
|
||||
- distinguishes `actual` vs `estimated` vs `included` vs `unknown`
|
||||
- reconciles post-hoc costs when providers expose authoritative billing data
|
||||
- supports direct providers, OpenRouter, subscriptions, enterprise pricing, and custom endpoints
|
||||
|
||||
## Problems In The Current Design
|
||||
|
||||
Current Hermes behavior has four structural issues:
|
||||
|
||||
1. It stores only `prompt_tokens` and `completion_tokens`, which is insufficient for providers that bill cache reads and cache writes separately.
|
||||
2. It uses a static model price table and fuzzy heuristics, which can drift from current official pricing.
|
||||
3. It assumes public API list pricing matches the user's real billing path.
|
||||
4. It has no distinction between live estimates and reconciled billed cost.
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. Normalize usage before pricing.
|
||||
2. Never fold cached tokens into plain input cost.
|
||||
3. Track certainty explicitly.
|
||||
4. Treat the billing path as part of the model identity.
|
||||
5. Prefer official machine-readable sources over scraped docs.
|
||||
6. Use post-hoc provider cost APIs when available.
|
||||
7. Show `n/a` rather than inventing precision.
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
The new system has four layers:
|
||||
|
||||
1. `usage_normalization`
|
||||
Converts raw provider usage into a canonical usage record.
|
||||
2. `pricing_source_resolution`
|
||||
Determines the billing path, source of truth, and applicable pricing source.
|
||||
3. `cost_estimation_and_reconciliation`
|
||||
Produces an immediate estimate when possible, then replaces or annotates it with actual billed cost later.
|
||||
4. `presentation`
|
||||
`/usage`, `/insights`, and the status bar display cost with certainty metadata.
|
||||
|
||||
## Canonical Usage Record
|
||||
|
||||
Add a canonical usage model that every provider path maps into before any pricing math happens.
|
||||
|
||||
Suggested structure:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CanonicalUsage:
|
||||
provider: str
|
||||
billing_provider: str
|
||||
model: str
|
||||
billing_route: str
|
||||
|
||||
input_tokens: int = 0
|
||||
output_tokens: int = 0
|
||||
cache_read_tokens: int = 0
|
||||
cache_write_tokens: int = 0
|
||||
reasoning_tokens: int = 0
|
||||
request_count: int = 1
|
||||
|
||||
raw_usage: dict[str, Any] | None = None
|
||||
raw_usage_fields: dict[str, str] | None = None
|
||||
computed_fields: set[str] | None = None
|
||||
|
||||
provider_request_id: str | None = None
|
||||
provider_generation_id: str | None = None
|
||||
provider_response_id: str | None = None
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `input_tokens` means non-cached input only.
|
||||
- `cache_read_tokens` and `cache_write_tokens` are never merged into `input_tokens`.
|
||||
- `output_tokens` excludes cache metrics.
|
||||
- `reasoning_tokens` is telemetry unless a provider officially bills it separately.
|
||||
|
||||
This is the same normalization pattern used by `opencode`, extended with provenance and reconciliation ids.
|
||||
|
||||
## Provider Normalization Rules
|
||||
|
||||
### OpenAI Direct
|
||||
|
||||
Source usage fields:
|
||||
|
||||
- `prompt_tokens`
|
||||
- `completion_tokens`
|
||||
- `prompt_tokens_details.cached_tokens`
|
||||
|
||||
Normalization:
|
||||
|
||||
- `cache_read_tokens = cached_tokens`
|
||||
- `input_tokens = prompt_tokens - cached_tokens`
|
||||
- `cache_write_tokens = 0` unless OpenAI exposes it in the relevant route
|
||||
- `output_tokens = completion_tokens`
|
||||
|
||||
### Anthropic Direct
|
||||
|
||||
Source usage fields:
|
||||
|
||||
- `input_tokens`
|
||||
- `output_tokens`
|
||||
- `cache_read_input_tokens`
|
||||
- `cache_creation_input_tokens`
|
||||
|
||||
Normalization:
|
||||
|
||||
- `input_tokens = input_tokens`
|
||||
- `output_tokens = output_tokens`
|
||||
- `cache_read_tokens = cache_read_input_tokens`
|
||||
- `cache_write_tokens = cache_creation_input_tokens`
|
||||
|
||||
### OpenRouter
|
||||
|
||||
Estimate-time usage normalization should use the response usage payload with the same rules as the underlying provider when possible.
|
||||
|
||||
Reconciliation-time records should also store:
|
||||
|
||||
- OpenRouter generation id
|
||||
- native token fields when available
|
||||
- `total_cost`
|
||||
- `cache_discount`
|
||||
- `upstream_inference_cost`
|
||||
- `is_byok`
|
||||
|
||||
### Gemini / Vertex
|
||||
|
||||
Use official Gemini or Vertex usage fields where available.
|
||||
|
||||
If cached content tokens are exposed:
|
||||
|
||||
- map them to `cache_read_tokens`
|
||||
|
||||
If a route exposes no cache creation metric:
|
||||
|
||||
- store `cache_write_tokens = 0`
|
||||
- preserve the raw usage payload for later extension
|
||||
|
||||
### DeepSeek And Other Direct Providers
|
||||
|
||||
Normalize only the fields that are officially exposed.
|
||||
|
||||
If a provider does not expose cache buckets:
|
||||
|
||||
- do not infer them unless the provider explicitly documents how to derive them
|
||||
|
||||
### Subscription / Included-Cost Routes
|
||||
|
||||
These still use the canonical usage model.
|
||||
|
||||
Tokens are tracked normally. Cost depends on billing mode, not on whether usage exists.
|
||||
|
||||
## Billing Route Model
|
||||
|
||||
Hermes must stop keying pricing solely by `model`.
|
||||
|
||||
Introduce a billing route descriptor:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class BillingRoute:
|
||||
provider: str
|
||||
base_url: str | None
|
||||
model: str
|
||||
billing_mode: str
|
||||
organization_hint: str | None = None
|
||||
```
|
||||
|
||||
`billing_mode` values:
|
||||
|
||||
- `official_cost_api`
|
||||
- `official_generation_api`
|
||||
- `official_models_api`
|
||||
- `official_docs_snapshot`
|
||||
- `subscription_included`
|
||||
- `user_override`
|
||||
- `custom_contract`
|
||||
- `unknown`
|
||||
|
||||
Examples:
|
||||
|
||||
- OpenAI direct API with Costs API access: `official_cost_api`
|
||||
- Anthropic direct API with Usage & Cost API access: `official_cost_api`
|
||||
- OpenRouter request before reconciliation: `official_models_api`
|
||||
- OpenRouter request after generation lookup: `official_generation_api`
|
||||
- GitHub Copilot style subscription route: `subscription_included`
|
||||
- local OpenAI-compatible server: `unknown`
|
||||
- enterprise contract with configured rates: `custom_contract`
|
||||
|
||||
## Cost Status Model
|
||||
|
||||
Every displayed cost should have:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CostResult:
|
||||
amount_usd: Decimal | None
|
||||
status: Literal["actual", "estimated", "included", "unknown"]
|
||||
source: Literal[
|
||||
"provider_cost_api",
|
||||
"provider_generation_api",
|
||||
"provider_models_api",
|
||||
"official_docs_snapshot",
|
||||
"user_override",
|
||||
"custom_contract",
|
||||
"none",
|
||||
]
|
||||
label: str
|
||||
fetched_at: datetime | None
|
||||
pricing_version: str | None
|
||||
notes: list[str]
|
||||
```
|
||||
|
||||
Presentation rules:
|
||||
|
||||
- `actual`: show dollar amount as final
|
||||
- `estimated`: show dollar amount with estimate labeling
|
||||
- `included`: show `included` or `$0.00 (included)` depending on UX choice
|
||||
- `unknown`: show `n/a`
|
||||
|
||||
## Official Source Hierarchy
|
||||
|
||||
Resolve cost using this order:
|
||||
|
||||
1. Request-level or account-level official billed cost
|
||||
2. Official machine-readable model pricing
|
||||
3. Official docs snapshot
|
||||
4. User override or custom contract
|
||||
5. Unknown
|
||||
|
||||
The system must never skip to a lower level if a higher-confidence source exists for the current billing route.
|
||||
|
||||
## Provider-Specific Truth Rules
|
||||
|
||||
### OpenAI Direct
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. Costs API for reconciled spend
|
||||
2. Official pricing page for live estimate
|
||||
|
||||
### Anthropic Direct
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. Usage & Cost API for reconciled spend
|
||||
2. Official pricing docs for live estimate
|
||||
|
||||
### OpenRouter
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. `GET /api/v1/generation` for reconciled `total_cost`
|
||||
2. `GET /api/v1/models` pricing for live estimate
|
||||
|
||||
Do not use underlying provider public pricing as the source of truth for OpenRouter billing.
|
||||
|
||||
### Gemini / Vertex
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. official billing export or billing API for reconciled spend when available for the route
|
||||
2. official pricing docs for estimate
|
||||
|
||||
### DeepSeek
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. official machine-readable cost source if available in the future
|
||||
2. official pricing docs snapshot today
|
||||
|
||||
### Subscription-Included Routes
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. explicit route config marking the model as included in subscription
|
||||
|
||||
These should display `included`, not an API list-price estimate.
|
||||
|
||||
### Custom Endpoint / Local Model
|
||||
|
||||
Preferred truth:
|
||||
|
||||
1. user override
|
||||
2. custom contract config
|
||||
3. unknown
|
||||
|
||||
These should default to `unknown`.
|
||||
|
||||
## Pricing Catalog
|
||||
|
||||
Replace the current `MODEL_PRICING` dict with a richer pricing catalog.
|
||||
|
||||
Suggested record:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PricingEntry:
|
||||
provider: str
|
||||
route_pattern: str
|
||||
model_pattern: str
|
||||
|
||||
input_cost_per_million: Decimal | None = None
|
||||
output_cost_per_million: Decimal | None = None
|
||||
cache_read_cost_per_million: Decimal | None = None
|
||||
cache_write_cost_per_million: Decimal | None = None
|
||||
request_cost: Decimal | None = None
|
||||
image_cost: Decimal | None = None
|
||||
|
||||
source: str = "official_docs_snapshot"
|
||||
source_url: str | None = None
|
||||
fetched_at: datetime | None = None
|
||||
pricing_version: str | None = None
|
||||
```
|
||||
|
||||
The catalog should be route-aware:
|
||||
|
||||
- `openai:gpt-5`
|
||||
- `anthropic:claude-opus-4-6`
|
||||
- `openrouter:anthropic/claude-opus-4.6`
|
||||
- `copilot:gpt-4o`
|
||||
|
||||
This avoids conflating direct-provider billing with aggregator billing.
|
||||
|
||||
## Pricing Sync Architecture
|
||||
|
||||
Introduce a pricing sync subsystem instead of manually maintaining a single hardcoded table.
|
||||
|
||||
Suggested modules:
|
||||
|
||||
- `agent/pricing/catalog.py`
|
||||
- `agent/pricing/sources.py`
|
||||
- `agent/pricing/sync.py`
|
||||
- `agent/pricing/reconcile.py`
|
||||
- `agent/pricing/types.py`
|
||||
|
||||
### Sync Sources
|
||||
|
||||
- OpenRouter models API
|
||||
- official provider docs snapshots where no API exists
|
||||
- user overrides from config
|
||||
|
||||
### Sync Output
|
||||
|
||||
Cache pricing entries locally with:
|
||||
|
||||
- source URL
|
||||
- fetch timestamp
|
||||
- version/hash
|
||||
- confidence/source type
|
||||
|
||||
### Sync Frequency
|
||||
|
||||
- startup warm cache
|
||||
- background refresh every 6 to 24 hours depending on source
|
||||
- manual `hermes pricing sync`
|
||||
|
||||
## Reconciliation Architecture
|
||||
|
||||
Live requests may produce only an estimate initially. Hermes should reconcile them later when a provider exposes actual billed cost.
|
||||
|
||||
Suggested flow:
|
||||
|
||||
1. Agent call completes.
|
||||
2. Hermes stores canonical usage plus reconciliation ids.
|
||||
3. Hermes computes an immediate estimate if a pricing source exists.
|
||||
4. A reconciliation worker fetches actual cost when supported.
|
||||
5. Session and message records are updated with `actual` cost.
|
||||
|
||||
This can run:
|
||||
|
||||
- inline for cheap lookups
|
||||
- asynchronously for delayed provider accounting
|
||||
|
||||
## Persistence Changes
|
||||
|
||||
Session storage should stop storing only aggregate prompt/completion totals.
|
||||
|
||||
Add fields for both usage and cost certainty:
|
||||
|
||||
- `input_tokens`
|
||||
- `output_tokens`
|
||||
- `cache_read_tokens`
|
||||
- `cache_write_tokens`
|
||||
- `reasoning_tokens`
|
||||
- `estimated_cost_usd`
|
||||
- `actual_cost_usd`
|
||||
- `cost_status`
|
||||
- `cost_source`
|
||||
- `pricing_version`
|
||||
- `billing_provider`
|
||||
- `billing_mode`
|
||||
|
||||
If schema expansion is too large for one PR, add a new pricing events table:
|
||||
|
||||
```text
|
||||
session_cost_events
|
||||
id
|
||||
session_id
|
||||
request_id
|
||||
provider
|
||||
model
|
||||
billing_mode
|
||||
input_tokens
|
||||
output_tokens
|
||||
cache_read_tokens
|
||||
cache_write_tokens
|
||||
estimated_cost_usd
|
||||
actual_cost_usd
|
||||
cost_status
|
||||
cost_source
|
||||
pricing_version
|
||||
created_at
|
||||
updated_at
|
||||
```
|
||||
|
||||
## Hermes Touchpoints
|
||||
|
||||
### `run_agent.py`
|
||||
|
||||
Current responsibility:
|
||||
|
||||
- parse raw provider usage
|
||||
- update session token counters
|
||||
|
||||
New responsibility:
|
||||
|
||||
- build `CanonicalUsage`
|
||||
- update canonical counters
|
||||
- store reconciliation ids
|
||||
- emit usage event to pricing subsystem
|
||||
|
||||
### `agent/usage_pricing.py`
|
||||
|
||||
Current responsibility:
|
||||
|
||||
- static lookup table
|
||||
- direct cost arithmetic
|
||||
|
||||
New responsibility:
|
||||
|
||||
- move or replace with pricing catalog facade
|
||||
- no fuzzy model-family heuristics
|
||||
- no direct pricing without billing-route context
|
||||
|
||||
### `cli.py`
|
||||
|
||||
Current responsibility:
|
||||
|
||||
- compute session cost directly from prompt/completion totals
|
||||
|
||||
New responsibility:
|
||||
|
||||
- display `CostResult`
|
||||
- show status badges:
|
||||
- `actual`
|
||||
- `estimated`
|
||||
- `included`
|
||||
- `n/a`
|
||||
|
||||
### `agent/insights.py`
|
||||
|
||||
Current responsibility:
|
||||
|
||||
- recompute historical estimates from static pricing
|
||||
|
||||
New responsibility:
|
||||
|
||||
- aggregate stored pricing events
|
||||
- prefer actual cost over estimate
|
||||
- surface estimates only when reconciliation is unavailable
|
||||
|
||||
## UX Rules
|
||||
|
||||
### Status Bar
|
||||
|
||||
Show one of:
|
||||
|
||||
- `$1.42`
|
||||
- `~$1.42`
|
||||
- `included`
|
||||
- `cost n/a`
|
||||
|
||||
Where:
|
||||
|
||||
- `$1.42` means `actual`
|
||||
- `~$1.42` means `estimated`
|
||||
- `included` means subscription-backed or explicitly zero-cost route
|
||||
- `cost n/a` means unknown
|
||||
|
||||
### `/usage`
|
||||
|
||||
Show:
|
||||
|
||||
- token buckets
|
||||
- estimated cost
|
||||
- actual cost if available
|
||||
- cost status
|
||||
- pricing source
|
||||
|
||||
### `/insights`
|
||||
|
||||
Aggregate:
|
||||
|
||||
- actual cost totals
|
||||
- estimated-only totals
|
||||
- unknown-cost sessions count
|
||||
- included-cost sessions count
|
||||
|
||||
## Config And Overrides
|
||||
|
||||
Add user-configurable pricing overrides in config:
|
||||
|
||||
```yaml
|
||||
pricing:
|
||||
mode: hybrid
|
||||
sync_on_startup: true
|
||||
sync_interval_hours: 12
|
||||
overrides:
|
||||
- provider: openrouter
|
||||
model: anthropic/claude-opus-4.6
|
||||
billing_mode: custom_contract
|
||||
input_cost_per_million: 4.25
|
||||
output_cost_per_million: 22.0
|
||||
cache_read_cost_per_million: 0.5
|
||||
cache_write_cost_per_million: 6.0
|
||||
included_routes:
|
||||
- provider: copilot
|
||||
model: "*"
|
||||
- provider: codex-subscription
|
||||
model: "*"
|
||||
```
|
||||
|
||||
Overrides must win over catalog defaults for the matching billing route.
|
||||
|
||||
## Rollout Plan
|
||||
|
||||
### Phase 1
|
||||
|
||||
- add canonical usage model
|
||||
- split cache token buckets in `run_agent.py`
|
||||
- stop pricing cache-inflated prompt totals
|
||||
- preserve current UI with improved backend math
|
||||
|
||||
### Phase 2
|
||||
|
||||
- add route-aware pricing catalog
|
||||
- integrate OpenRouter models API sync
|
||||
- add `estimated` vs `included` vs `unknown`
|
||||
|
||||
### Phase 3
|
||||
|
||||
- add reconciliation for OpenRouter generation cost
|
||||
- add actual cost persistence
|
||||
- update `/insights` to prefer actual cost
|
||||
|
||||
### Phase 4
|
||||
|
||||
- add direct OpenAI and Anthropic reconciliation paths
|
||||
- add user overrides and contract pricing
|
||||
- add pricing sync CLI command
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
Add tests for:
|
||||
|
||||
- OpenAI cached token subtraction
|
||||
- Anthropic cache read/write separation
|
||||
- OpenRouter estimated vs actual reconciliation
|
||||
- subscription-backed models showing `included`
|
||||
- custom endpoints showing `n/a`
|
||||
- override precedence
|
||||
- stale catalog fallback behavior
|
||||
|
||||
Current tests that assume heuristic pricing should be replaced with route-aware expectations.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- exact enterprise billing reconstruction without an official source or user override
|
||||
- backfilling perfect historical cost for old sessions that lack cache bucket data
|
||||
- scraping arbitrary provider web pages at request time
|
||||
|
||||
## Recommendation
|
||||
|
||||
Do not expand the existing `MODEL_PRICING` dict.
|
||||
|
||||
That path cannot satisfy the product requirement. Hermes should instead migrate to:
|
||||
|
||||
- canonical usage normalization
|
||||
- route-aware pricing sources
|
||||
- estimate-then-reconcile cost lifecycle
|
||||
- explicit certainty states in the UI
|
||||
|
||||
This is the minimum architecture that makes the statement "Hermes pricing is backed by official sources where possible, and otherwise clearly labeled" defensible.
|
||||
Reference in New Issue
Block a user