Timmy API Testkit Report — Claude Opus 4.6 — 14/20 PASS (2 rate-limit false failures, pre-fix) #25

Closed
opened 2026-03-19 01:03:20 +00:00 by replit · 2 comments
Owner

Timmy API Testkit Report — Claude (Opus 4.6)

Tester: Claude (Opus 4.6) via browser automation
Date: 2026-03-18
Base URL tested: https://9f85e954-647c-46a5-90a7-396e495a805a-00-clz2vhmfuk7p.spock.replit.dev
Method: Manual (browser automation)
Testkit version: v3 (pre-fix — tested before PR #24 was merged)


Mode 1 — Per-Job

Test Pass / Fail Latency Notes
1 — Health check PASS HTTP 200, status=ok, uptime 776s, 49 jobs total
2 — Create job PASS HTTP 201, jobId returned, evalInvoice.amountSats=10
3a — Poll (state) PASS state=awaiting_eval_payment
3b — Poll (paymentHash) PASS paymentHash present, stub mode active
4 — Pay eval invoice PASS ok: true
5 — Eval state advance PASS 3s state=awaiting_work_payment, workInvoice.amountSats=182
6 — Pay work + result PASS 5s state=complete, coherent LN explanation
7 — Demo endpoint FAIL <1s HTTP 429 — rate-limit exhausted by prior runs
8a — Missing body PASS HTTP 400
8b — Unknown job ID PASS HTTP 404
8c — Demo missing param FAIL HTTP 429 instead of 400 — rate limiter fires before param validation
8d — 501-char request PASS HTTP 400, mentions 500 chars
9 — Rate limiter PASS All 6 requests returned 429
10 — Rejection path PASS 2s state=rejected, reason cites ethical/legal guidelines

Mode 2 — Session

Test Pass / Fail Notes
11 — Create session PASS HTTP 201, state=awaiting_payment, amountSats=200
12 — Poll before payment PASS state=awaiting_payment
13 — Pay + activate PASS state=active, balanceSats=200, macaroon present
14 — Submit request PASS state=complete, debitedSats=179, balanceRemaining=21, latency 2s
15 — Reject no macaroon PASS HTTP 401
16 — Topup invoice PASS paymentRequest present, amountSats=500

Overall verdict: PASS (failures are test-ordering artifacts, not API bugs)

Total: PASS=14 FAIL=2 SKIP=0


Issues Found

  1. Tests 7 and 8c — both fail due to rate-limit exhaustion from prior runs. The same root cause as Hermes #23. Already fixed in PR #24 (testkit v4: test 8c moved before tests 7 and 9; also adds head -n-1sed $d for macOS compat).

  2. "Stop Claude" button in healthz response — Claude flagged this as a possible prompt injection. It is a non-issue: this is the Anthropic web UI's own Stop button rendered in the browser tab, not anything served by our API. GET /api/healthz returns plain JSON.


Observations on Result Quality

  • Test 6 (LN explanation): accurate, correctly describes Layer 2 and payment channels.
  • Test 10 (adversarial): correctly rejected with clear ethical/legal reasoning.
  • Test 14 (session request): 179 sat debit for a short request — token-based pricing working as documented.

Suggestions

  • None beyond what Perplexity and Hermes already noted. Rate-limit ordering fix covers both failures.
## Timmy API Testkit Report — Claude (Opus 4.6) **Tester:** Claude (Opus 4.6) via browser automation **Date:** 2026-03-18 **Base URL tested:** `https://9f85e954-647c-46a5-90a7-396e495a805a-00-clz2vhmfuk7p.spock.replit.dev` **Method:** Manual (browser automation) **Testkit version:** v3 (pre-fix — tested before PR #24 was merged) --- ### Mode 1 — Per-Job | Test | Pass / Fail | Latency | Notes | |------|------------|---------|-------| | 1 — Health check | **PASS** | — | HTTP 200, status=ok, uptime 776s, 49 jobs total | | 2 — Create job | **PASS** | — | HTTP 201, jobId returned, evalInvoice.amountSats=10 | | 3a — Poll (state) | **PASS** | — | state=awaiting_eval_payment | | 3b — Poll (paymentHash) | **PASS** | — | paymentHash present, stub mode active | | 4 — Pay eval invoice | **PASS** | — | ok: true | | 5 — Eval state advance | **PASS** | 3s | state=awaiting_work_payment, workInvoice.amountSats=182 | | 6 — Pay work + result | **PASS** | 5s | state=complete, coherent LN explanation | | 7 — Demo endpoint | **FAIL** | <1s | HTTP 429 — rate-limit exhausted by prior runs | | 8a — Missing body | **PASS** | — | HTTP 400 | | 8b — Unknown job ID | **PASS** | — | HTTP 404 | | 8c — Demo missing param | **FAIL** | — | HTTP 429 instead of 400 — rate limiter fires before param validation | | 8d — 501-char request | **PASS** | — | HTTP 400, mentions 500 chars | | 9 — Rate limiter | **PASS** | — | All 6 requests returned 429 | | 10 — Rejection path | **PASS** | 2s | state=rejected, reason cites ethical/legal guidelines | ### Mode 2 — Session | Test | Pass / Fail | Notes | |------|------------|-------| | 11 — Create session | **PASS** | HTTP 201, state=awaiting_payment, amountSats=200 | | 12 — Poll before payment | **PASS** | state=awaiting_payment | | 13 — Pay + activate | **PASS** | state=active, balanceSats=200, macaroon present | | 14 — Submit request | **PASS** | state=complete, debitedSats=179, balanceRemaining=21, latency 2s | | 15 — Reject no macaroon | **PASS** | HTTP 401 | | 16 — Topup invoice | **PASS** | paymentRequest present, amountSats=500 | **Overall verdict:** PASS (failures are test-ordering artifacts, not API bugs) **Total:** PASS=14 FAIL=2 SKIP=0 --- ### Issues Found 1. **Tests 7 and 8c** — both fail due to rate-limit exhaustion from prior runs. The same root cause as Hermes #23. Already fixed in PR #24 (testkit v4: test 8c moved before tests 7 and 9; also adds `head -n-1` → `sed $d` for macOS compat). 2. **"Stop Claude" button in healthz response** — Claude flagged this as a possible prompt injection. It is a non-issue: this is the Anthropic web UI's own Stop button rendered in the browser tab, not anything served by our API. `GET /api/healthz` returns plain JSON. --- ### Observations on Result Quality - Test 6 (LN explanation): accurate, correctly describes Layer 2 and payment channels. - Test 10 (adversarial): correctly rejected with clear ethical/legal reasoning. - Test 14 (session request): 179 sat debit for a short request — token-based pricing working as documented. ### Suggestions - None beyond what Perplexity and Hermes already noted. Rate-limit ordering fix covers both failures.
claude was assigned by Rockachopa 2026-03-22 23:38:26 +00:00
Collaborator

PR #54 created.

Changes:

  1. Reordered testkit: moved test 8c (demo param validation) before tests 7 and 9 so rate-limit quota is not consumed before validation is checked — eliminates both false failures
  2. Replaced all 19 head -n-1 calls with sed '$d' for macOS compatibility
  3. Archived the Claude Opus 4.6 testkit report to reports/timmy-testkit-claude-opus.md
PR #54 created. **Changes:** 1. Reordered testkit: moved test 8c (demo param validation) before tests 7 and 9 so rate-limit quota is not consumed before validation is checked — eliminates both false failures 2. Replaced all 19 `head -n-1` calls with `sed '$d'` for macOS compatibility 3. Archived the Claude Opus 4.6 testkit report to `reports/timmy-testkit-claude-opus.md`
Collaborator

PR #54 rebased on latest main and updated.

Summary of changes:

  1. Reordered testkit: test 8c (demo param validation) now runs before tests 7 and 9, preventing rate-limit false failures
  2. Replaced all head -n-1 with sed $d for macOS compatibility
  3. Archived Claude Opus 4.6 testkit report to reports/timmy-testkit-claude-opus.md

Shell syntax validated, no conflicts with main.

PR #54 rebased on latest main and updated. **Summary of changes:** 1. Reordered testkit: test 8c (demo param validation) now runs before tests 7 and 9, preventing rate-limit false failures 2. Replaced all `head -n-1` with `sed $d` for macOS compatibility 3. Archived Claude Opus 4.6 testkit report to `reports/timmy-testkit-claude-opus.md` Shell syntax validated, no conflicts with main.
This repo is archived. You cannot comment on issues.