Files
timmy-tower/TIMMY_TEST_PLAN.md
Alexander Whitestone 9aaf5b3ab7 feat: testkit T37-T39 for Timmy economic peer + fix mobile typecheck (Refs #45)
Add testkit coverage for the Timmy economic peer endpoints (issue #45):
- T37: GET /api/identity/timmy — npub, pubkeyHex, zapCount shape assertions
- T38/T39: POST /api/identity/vouch auth guards (no token / invalid token → 401)
- Update TIMMY_TEST_PLAN.md with new test table rows
- Renumber former FUTURE T37-T40 stubs to T40-T43

Fix pre-existing mobile typecheck error: move slideStyles declaration before
the slides array in artifacts/mobile/app/onboarding.tsx to resolve TS2448/2454.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 16:19:23 -04:00

9.3 KiB
Raw Blame History

Timmy API — Test Plan & Report Prompt

What is Timmy? Timmy is a Lightning Network-gated AI agent API. Users pay Bitcoin (via Lightning) to submit requests to an AI agent (Claude). Two payment modes:

  • Mode 1 — Per-Job (live): Pay per request. Eval invoice (10 sats fixed) → Haiku judges the request → work invoice (dynamic, token-based) → Sonnet executes → result delivered.
  • Mode 2 — Session (live): Pre-fund a credit balance. Requests automatically debit actual compute cost (eval + work tokens × 1.4 margin, converted to sats at live BTC/USD). No per-job invoices once active.

Live base URL:

https://9f85e954-647c-46a5-90a7-396e495a805a-00-clz2vhmfuk7p.spock.replit.dev

Running the full test suite — one command

curl -s https://9f85e954-647c-46a5-90a7-396e495a805a-00-clz2vhmfuk7p.spock.replit.dev/api/testkit | bash

The server returns a self-contained bash script with the base URL already baked in. Requirements: curl, bash, jq — nothing else.

Note for repeat runs: Tests 7 and 8c hit GET /api/demo, which is rate-limited to 5 req/hr per IP. If you run the testkit more than once in the same hour from the same IP, those two checks will return 429. This is expected behaviour — the rate limiter is working correctly. Run from a fresh IP (or wait an hour) for a clean 20/20.


What the testkit covers

Mode 1 — Per-Job (tests 110)

# Name What it checks
1 Health check GET /api/healthz → HTTP 200, status=ok
2 Create job POST /api/jobs → HTTP 201, jobId + evalInvoice.amountSats=10
3 Poll before payment GET /api/jobs/:idstate=awaiting_eval_payment, invoice echoed, paymentHash present in stub mode
4 Pay eval invoice POST /api/dev/stub/pay/:hash{"ok":true}
5 Eval state advance Polls until state=awaiting_work_payment OR state=rejected (30s timeout)
6 Pay work + get result Pays work invoice, polls until state=complete, result non-empty (30s timeout)
7 Demo endpoint GET /api/demo?request=... → HTTP 200, coherent result
8a Missing body POST /api/jobs {} → HTTP 400
8b Unknown job ID GET /api/jobs/does-not-exist → HTTP 404
8c Demo missing param GET /api/demo → HTTP 400
8d 501-char request POST /api/jobs with 501 chars → HTTP 400 mentioning "500 characters"
9 Rate limiter 6× GET /api/demo → at least one HTTP 429
10 Rejection path Adversarial request goes through eval, polls until state=rejected with a non-empty reason

Mode 2 — Session (tests 1116)

# Name What it checks
11 Create session POST /api/sessions {"amount_sats":200} → HTTP 201, sessionId, state=awaiting_payment, invoice.amountSats=200
12 Poll before payment GET /api/sessions/:idstate=awaiting_payment before invoice is paid
13 Pay deposit + activate Pays deposit via stub, polls GET → state=active, balanceSats=200, macaroon present
14 Submit request (accepted) POST /api/sessions/:id/request with valid macaroon → state=complete OR state=rejected, debitedSats>0, balanceRemaining decremented
15 Request without macaroon Same endpoint, no Authorization header → HTTP 401
16 Topup invoice creation POST /api/sessions/:id/topup {"amount_sats":500} with macaroon → HTTP 200, topup.paymentRequest present, topup.amountSats=500

Nostr identity + trust engine (tests 2536)

# Name What it checks
25 Challenge nonce POST /api/identity/challenge → HTTP 200, nonce is 64-char hex, expiresAt is ISO in future
26 Verify: missing event POST /api/identity/verify {} → HTTP 400, non-empty error
27 Verify: unknown nonce POST /api/identity/verify with fake nonce in content → HTTP 401, error contains "Nonce not found"
28 Me: no token GET /api/identity/me without header → HTTP 401, error contains "Missing"
29 Me: invalid token GET /api/identity/me with X-Nostr-Token: totally.invalid.token → HTTP 401
30 Sessions: bogus token POST /api/sessions with X-Nostr-Token: badtoken → HTTP 401, no sessionId in response
31 Jobs: bogus token POST /api/jobs with X-Nostr-Token: badtoken → HTTP 401
32 Sessions anonymous tier POST /api/sessions (no token) → HTTP 201, trust_tier == "anonymous"
33 Jobs anonymous tier POST /api/jobs (no token) → HTTP 201, trust_tier == "anonymous"
34 GET jobs/:id includes tier GET /api/jobs/:id → HTTP 200, trust_tier non-null (anonymous job → "anonymous")
35 GET sessions/:id includes tier GET /api/sessions/:id → HTTP 200, trust_tier == "anonymous"
36 Full challenge→sign→verify Inline node script: generate keypair, challenge, sign kind=27235 event, verify → token; GET /identity/me → tier=new, pubkey matches

Timmy economic peer — identity card + vouching guards (tests 3739)

# Name What it checks
37 Timmy identity card GET /api/identity/timmy → HTTP 200, npub starts with npub1, pubkeyHex is 64 hex chars, zapCount is a non-negative integer
38 Vouch: no token POST /api/identity/vouch without X-Nostr-Token → HTTP 401
39 Vouch: invalid token POST /api/identity/vouch with X-Nostr-Token: totally.invalid.token → HTTP 401

Architecture notes for reviewers

Mode 1 mechanics

  • Stub mode is active (no real Lightning node). paymentHash is exposed on GET responses so the testkit can drive the full payment flow automatically. In production (real LNbits), paymentHash is hidden.
  • POST /api/dev/stub/pay/:hash is only mounted when NODE_ENV !== 'production'.
  • State machine advances server-side on every GET poll — no webhooks.
  • AI models: Haiku for eval (cheap gating), Sonnet for work (full output).
  • Pricing: eval = 10 sats fixed. Work invoice = actual token usage (input + output) × Anthropic per-token rate × 1.4 margin, converted at live BTC/USD. This is dynamic — a 53-char request typically produces an invoice of ~180 sats, not a fixed tier. The old 50/100/250 sat fixed tiers were replaced by this model.
  • Max request length: 500 chars. Rate limiter: 5 req/hr/IP on /api/demo (in-memory, resets on server restart).

Mode 2 mechanics

  • Minimum deposit: 100 sats. Maximum: 10,000 sats. Minimum working balance: 50 sats.
  • Session expiry: 24 hours of inactivity. Balance is forfeited on expiry. Expiry is stated in the expiresAt field of every session response.
  • Auth: Authorization: Bearer <macaroon> header. Macaroon is issued on first activation (GET /sessions/:id after deposit is paid).
  • Cost per request: (eval tokens + work tokens) × model rate × 1.4 margin → converted to sats. If a request starts with enough balance but actual cost pushes balance negative, the request still completes and delivers — only the next request is blocked.
  • If balance drops below 50 sats, session transitions to paused. Top up via POST /sessions/:id/topup. Session resumes automatically on the next GET poll once the topup invoice is paid.
  • The same POST /api/dev/stub/pay/:hash endpoint works for all invoice types (eval, work, session deposit, topup).

Eval + work latency (important for manual testers)

The eval call uses the real Anthropic API (Haiku), typically 25 seconds. The testkit uses polling loops (max 30s). Manual testers should poll with similar patience. The work call (Sonnet) typically runs 38 seconds.


Test results log

Date Tester Score Notes
2026-03-18 Perplexity Computer 20/20 PASS Issue #22
2026-03-18 Hermes (Claude Opus 4) 19/20 (pre-fix) Issue #23; 1 failure = test ordering bug (8c hit rate limiter before param check). Fixed in testkit v4.
2026-03-19 Replit Agent (post-fix) 20/20 PASS Verified on fresh server after testkit v4 — all fixes confirmed
2026-03-18 Claude Opus 4.6 14/20 (pre-fix) Issue #25; 2 failures = same rate-limit ordering as Hermes. Fixed in testkit v4.

Report template

Tester: [Claude / Perplexity / Kimi / Hermes / Human / Other] Date: Base URL tested: Method: [Automated (curl … | bash) / Manual]

Mode 1 — Per-Job

Test Pass / Fail / Skip Latency Notes
1 — Health check
2 — Create job
3 — Poll before payment
4 — Pay eval invoice
5 — Eval state advance ___s
6 — Pay work + result ___s
7 — Demo endpoint ___s
8a — Missing body
8b — Unknown job ID
8c — Demo missing param
8d — 501-char request
9 — Rate limiter
10 — Rejection path ___s

Mode 2 — Session

Test Pass / Fail / Skip Notes
11 — Create session
12 — Poll before payment
13 — Pay + activate
14 — Submit request
15 — Reject no macaroon
16 — Topup invoice

Overall verdict: Pass / Partial / Fail

Total: PASS=___ FAIL=___ SKIP=___

Issues found:

Observations on result quality:

Suggestions: