Timmy API Testkit Report — Claude Opus 4.6 — 14/20 PASS (2 rate-limit false failures, pre-fix) #25

New Issue

replit · 2026-03-19T01:03:20Z

replit commented

2026-03-19 01:03:20 +00:00

Timmy API Testkit Report — Claude (Opus 4.6)

Tester: Claude (Opus 4.6) via browser automation
Date: 2026-03-18
Base URL tested: https://9f85e954-647c-46a5-90a7-396e495a805a-00-clz2vhmfuk7p.spock.replit.dev
Method: Manual (browser automation)
Testkit version: v3 (pre-fix — tested before PR #24 was merged)

Mode 1 — Per-Job

Test	Pass / Fail	Latency	Notes
1 — Health check	PASS	—	HTTP 200, status=ok, uptime 776s, 49 jobs total
2 — Create job	PASS	—	HTTP 201, jobId returned, evalInvoice.amountSats=10
3a — Poll (state)	PASS	—	state=awaiting_eval_payment
3b — Poll (paymentHash)	PASS	—	paymentHash present, stub mode active
4 — Pay eval invoice	PASS	—	ok: true
5 — Eval state advance	PASS	3s	state=awaiting_work_payment, workInvoice.amountSats=182
6 — Pay work + result	PASS	5s	state=complete, coherent LN explanation
7 — Demo endpoint	FAIL	<1s	HTTP 429 — rate-limit exhausted by prior runs
8a — Missing body	PASS	—	HTTP 400
8b — Unknown job ID	PASS	—	HTTP 404
8c — Demo missing param	FAIL	—	HTTP 429 instead of 400 — rate limiter fires before param validation
8d — 501-char request	PASS	—	HTTP 400, mentions 500 chars
9 — Rate limiter	PASS	—	All 6 requests returned 429
10 — Rejection path	PASS	2s	state=rejected, reason cites ethical/legal guidelines

Mode 2 — Session

Test	Pass / Fail	Notes
11 — Create session	PASS	HTTP 201, state=awaiting_payment, amountSats=200
12 — Poll before payment	PASS	state=awaiting_payment
13 — Pay + activate	PASS	state=active, balanceSats=200, macaroon present
14 — Submit request	PASS	state=complete, debitedSats=179, balanceRemaining=21, latency 2s
15 — Reject no macaroon	PASS	HTTP 401
16 — Topup invoice	PASS	paymentRequest present, amountSats=500

Overall verdict: PASS (failures are test-ordering artifacts, not API bugs)

Total: PASS=14 FAIL=2 SKIP=0

Issues Found

Tests 7 and 8c — both fail due to rate-limit exhaustion from prior runs. The same root cause as Hermes #23. Already fixed in PR #24 (testkit v4: test 8c moved before tests 7 and 9; also adds head -n-1 → sed $d for macOS compat).
"Stop Claude" button in healthz response — Claude flagged this as a possible prompt injection. It is a non-issue: this is the Anthropic web UI's own Stop button rendered in the browser tab, not anything served by our API. GET /api/healthz returns plain JSON.

Observations on Result Quality

Test 6 (LN explanation): accurate, correctly describes Layer 2 and payment channels.
Test 10 (adversarial): correctly rejected with clear ethical/legal reasoning.
Test 14 (session request): 179 sat debit for a short request — token-based pricing working as documented.

Suggestions

None beyond what Perplexity and Hermes already noted. Rate-limit ordering fix covers both failures.

## Timmy API Testkit Report — Claude (Opus 4.6) **Tester:** Claude (Opus 4.6) via browser automation **Date:** 2026-03-18 **Base URL tested:** `https://9f85e954-647c-46a5-90a7-396e495a805a-00-clz2vhmfuk7p.spock.replit.dev` **Method:** Manual (browser automation) **Testkit version:** v3 (pre-fix — tested before PR #24 was merged) --- ### Mode 1 — Per-Job | Test | Pass / Fail | Latency | Notes | |------|------------|---------|-------| | 1 — Health check | **PASS** | — | HTTP 200, status=ok, uptime 776s, 49 jobs total | | 2 — Create job | **PASS** | — | HTTP 201, jobId returned, evalInvoice.amountSats=10 | | 3a — Poll (state) | **PASS** | — | state=awaiting_eval_payment | | 3b — Poll (paymentHash) | **PASS** | — | paymentHash present, stub mode active | | 4 — Pay eval invoice | **PASS** | — | ok: true | | 5 — Eval state advance | **PASS** | 3s | state=awaiting_work_payment, workInvoice.amountSats=182 | | 6 — Pay work + result | **PASS** | 5s | state=complete, coherent LN explanation | | 7 — Demo endpoint | **FAIL** | <1s | HTTP 429 — rate-limit exhausted by prior runs | | 8a — Missing body | **PASS** | — | HTTP 400 | | 8b — Unknown job ID | **PASS** | — | HTTP 404 | | 8c — Demo missing param | **FAIL** | — | HTTP 429 instead of 400 — rate limiter fires before param validation | | 8d — 501-char request | **PASS** | — | HTTP 400, mentions 500 chars | | 9 — Rate limiter | **PASS** | — | All 6 requests returned 429 | | 10 — Rejection path | **PASS** | 2s | state=rejected, reason cites ethical/legal guidelines | ### Mode 2 — Session | Test | Pass / Fail | Notes | |------|------------|-------| | 11 — Create session | **PASS** | HTTP 201, state=awaiting_payment, amountSats=200 | | 12 — Poll before payment | **PASS** | state=awaiting_payment | | 13 — Pay + activate | **PASS** | state=active, balanceSats=200, macaroon present | | 14 — Submit request | **PASS** | state=complete, debitedSats=179, balanceRemaining=21, latency 2s | | 15 — Reject no macaroon | **PASS** | HTTP 401 | | 16 — Topup invoice | **PASS** | paymentRequest present, amountSats=500 | **Overall verdict:** PASS (failures are test-ordering artifacts, not API bugs) **Total:** PASS=14 FAIL=2 SKIP=0 --- ### Issues Found 1. **Tests 7 and 8c** — both fail due to rate-limit exhaustion from prior runs. The same root cause as Hermes #23. Already fixed in PR #24 (testkit v4: test 8c moved before tests 7 and 9; also adds `head -n-1` → `sed $d` for macOS compat). 2. **"Stop Claude" button in healthz response** — Claude flagged this as a possible prompt injection. It is a non-issue: this is the Anthropic web UI's own Stop button rendered in the browser tab, not anything served by our API. `GET /api/healthz` returns plain JSON. --- ### Observations on Result Quality - Test 6 (LN explanation): accurate, correctly describes Layer 2 and payment channels. - Test 10 (adversarial): correctly rejected with clear ethical/legal reasoning. - Test 14 (session request): 179 sat debit for a short request — token-based pricing working as documented. ### Suggestions - None beyond what Perplexity and Hermes already noted. Rate-limit ordering fix covers both failures.

replit referenced this issue from a commit

2026-03-19 01:05:06 +00:00

docs: add Claude Opus 4.6 result to testkit results log (issue #25)

replit referenced this issue

2026-03-19 01:30:24 +00:00

feat: streaming, rate limiting, CORS, event bus, SSE — closes #2 #3 #4 #5; fixes #14 #16 #20

replit referenced this issue from a commit

2026-03-19 18:40:28 +00:00

feat: real LNbits mode support — 29/29 testkit PASS

replit referenced this issue from a commit

2026-03-19 18:40:28 +00:00

feat(task-25): real LNbits mode on Hermes VPS — 29/29 testkit PASS

replit referenced this issue from a commit

2026-03-19 18:40:28 +00:00

feat(task-25): LNbits on Hermes VPS — real-mode wiring, 29/29 PASS

claude was assigned by Rockachopa

2026-03-22 23:38:26 +00:00

claude referenced this issue from a commit

2026-03-23 01:04:24 +00:00

fix: reorder testkit to prevent rate-limit false failures

claude referenced a pull request that will close this issue

2026-03-23 01:05:22 +00:00

[claude] fix testkit rate-limit false failures (#25) #54

claude commented

2026-03-23 01:05:33 +00:00

PR #54 created.

Changes:

Reordered testkit: moved test 8c (demo param validation) before tests 7 and 9 so rate-limit quota is not consumed before validation is checked — eliminates both false failures
Replaced all 19 head -n-1 calls with sed '$d' for macOS compatibility
Archived the Claude Opus 4.6 testkit report to reports/timmy-testkit-claude-opus.md

PR #54 created. **Changes:** 1. Reordered testkit: moved test 8c (demo param validation) before tests 7 and 9 so rate-limit quota is not consumed before validation is checked — eliminates both false failures 2. Replaced all 19 `head -n-1` calls with `sed '$d'` for macOS compatibility 3. Archived the Claude Opus 4.6 testkit report to `reports/timmy-testkit-claude-opus.md`

claude referenced this issue from a commit

2026-03-23 02:07:49 +00:00

fix: reorder testkit to prevent rate-limit false failures

claude commented

2026-03-23 02:08:22 +00:00

PR #54 rebased on latest main and updated.

Summary of changes:

Reordered testkit: test 8c (demo param validation) now runs before tests 7 and 9, preventing rate-limit false failures
Replaced all head -n-1 with sed $d for macOS compatibility
Archived Claude Opus 4.6 testkit report to reports/timmy-testkit-claude-opus.md

Shell syntax validated, no conflicts with main.

PR #54 rebased on latest main and updated. **Summary of changes:** 1. Reordered testkit: test 8c (demo param validation) now runs before tests 7 and 9, preventing rate-limit false failures 2. Replaced all `head -n-1` with `sed $d` for macOS compatibility 3. Archived Claude Opus 4.6 testkit report to `reports/timmy-testkit-claude-opus.md` Shell syntax validated, no conflicts with main.

Rockachopa closed this issue

2026-03-23 14:51:32 +00:00

Rockachopa referenced this issue from a commit

2026-03-23 14:51:33 +00:00

[claude] fix testkit rate-limit false failures (#25) (#54)

This repo is archived. You cannot comment on issues.

2 Participants

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: replit/token-gated-economy#25