# "Meet and Partnership" Protocol Implementation Report ## 1. Discovery: NIP-89 Capability Card Monitoring ### 1.1 Protocol Overview #### 1.1.1 NIP-89 Specification for Application Handlers The **NIP-89 protocol** establishes a decentralized mechanism for service discovery in the Nostr ecosystem, eliminating the need for centralized registries or manual configuration [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . At its core, NIP-89 defines **Kind 31990 events**—"Capability Cards"—that platforms publish to advertise their supported task types, operational parameters, and preferred communication channels. This self-sovereign advertisement model aligns with Nostr's foundational principles: cryptographic identity binding, relay-based propagation, and permissionless participation. The specification introduces two complementary event kinds: **Kind 31989** for user recommendations of preferred handlers, and **Kind 31990** for platform self-announcement [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . For autonomous agent operation, Kind 31990 serves as the primary discovery vector, enabling continuous passive monitoring for new computational service providers. The protocol's replaceable event semantics (per NIP-33) ensure that platforms can update capabilities without historical accumulation—relays retain only the most recent event for each **pubkey + "d" tag** combination, creating a naturally fresh discovery stream. The architectural significance extends beyond simple advertisement. By mandating structured tag conventions and JSON-encoded content, NIP-89 enables **machine-parseable capability descriptions** that support automated decision-making. The cryptographic binding to platform npubs creates **non-repudiable provenance** for all capability claims, establishing the trust foundation for subsequent partnership interactions [(Github)](https://github.com/nostr-jp/nips-ja) . #### 1.1.2 Role of Kind 31990 Events in Platform Discovery Kind 31990 events function as **self-describing service contracts** that enable our agent to build and maintain a dynamic registry of computational partners. Each event carries three critical information layers: **identity** (the platform's npub), **capability** (supported task kinds via "k" tags), and **reachability** (preferred relays for communication) [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . This tripartite structure supports complete partnership initiation without external coordination. The discovery process operates as a **continuous background activity**: our agent maintains persistent WebSocket subscriptions to curated relays, filtering for Kind 31990 events with benchmark-relevant capabilities. Incoming events undergo multi-stage validation—signature verification, temporal freshness checks, and semantic parsing—before registry insertion. This pipeline ensures that only **verifiable, current, and relevant** platforms enter the candidate pool. The "Capability Card" terminology reflects the event's function as a **machine-readable business card** [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . Platforms compete for agent attention through transparent capability declarations, creating market dynamics that reward quality and reliability. Our agent's discovery subsystem must therefore implement sophisticated filtering and ranking to identify partnership candidates worth the investment of benchmark evaluation. #### 1.1.3 Relationship Between NIP-89 and NIP-90 Discovery Flow The **NIP-89/NIP-90 handshake** creates a clean separation between discovery and execution that enables modular, evolvable system design [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . NIP-89 answers "who can do what" through passive capability advertisement; NIP-90 answers "how do I request it" through active job delegation. This separation allows each protocol to evolve independently—new task kinds can be added to NIP-90 without NIP-89 modification, and discovery mechanisms can be enhanced without disrupting job execution semantics. The operational flow follows a **two-phase pattern**: (1) **continuous discovery** via Kind 31990 subscription, building and refreshing the candidate registry; (2) **selective engagement** via Kind 5000+ job requests, transforming discovered potential into verified partnerships. The interface between phases is the **platform npub and relay hints** extracted from capability cards, which directly inform job request addressing and routing [(Github)](https://github.com/nostr-jp/nips-ja) . This architectural clarity supports **fault-tolerant, scalable operation**. Discovery failures (missed capability cards) are recoverable through periodic re-subscription; execution failures (job timeouts) inform trust adjustment without invalidating discovery information. The agent maintains separation between the **discovery registry** (updated via NIP-89) and the **routing matrix** (updated via execution outcomes), with the latter referencing but not duplicating the former's identity data [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . ### 1.2 Relay Subscription Implementation #### 1.2.1 Filter Configuration for Kind 31990 Events Effective relay subscription requires **precision filtering** that captures all relevant capability advertisements without overwhelming processing capacity. The base filter specification targets Kind 31990 exclusively, with strategic extensions for operational efficiency: | Filter Parameter | Value | Purpose | |----------------|-------|---------| | `kinds` | `[31990]` | Isolate capability advertisements | | `since` | `last_processed_timestamp` | Avoid reprocessing on reconnection | | `#k` | `["5000", "benchmark-task"]` (optional) | Pre-filter for benchmark-relevant platforms | | `limit` | `1000` | Prevent initial sync flooding | The **time-based filtering** using `since` parameters is critical for operational continuity—our agent persists the highest `created_at` timestamp across all processed events, using this watermark to request only fresh advertisements on subscription re-establishment [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . For cold-start scenarios, a **24-hour lookback window** captures recently active platforms without excessive historical backfill. **Tag-based pre-filtering** via `#k` reduces client-side processing but risks missing platforms using non-standard capability identifiers. Our implementation employs **hybrid filtering**: broad subscription for comprehensive discovery, with client-side semantic matching for "benchmark-task" equivalents. This approach maximizes coverage while maintaining processing efficiency. Relay selection follows a **tiered pool strategy**: | Tier | Characteristics | Subscription Mode | |------|---------------|-----------------| | Tier 1 | High uptime, low latency, demonstrated NIP-90 activity | Persistent WebSocket | | Tier 2 | Moderate activity, geographic diversity | Periodic polling (5-minute intervals) | | Tier 3 | Long tail, emerging platforms | Stochastic sampling (hourly) | This tiered approach ensures **comprehensive coverage** without prohibitive resource consumption, with dynamic promotion/demotion based on observed event yield and connection stability [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 1.2.2 Tag Parsing: Extracting Supported Task Kinds from "k" Tags The **"k" tag** serves as the primary capability discriminator, with each tag value indicating a supported NIP-90 job kind [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . Parsing must handle **multiple conventions**: numeric kind identifiers (5000-5999), semantic strings ("benchmark-task"), and comma-separated compound values. Our normalizer expands these into an internal **capability set** per platform, enabling efficient task-kind matching during delegation. | "k" Tag Value | Interpretation | Action | |-------------|--------------|--------| | `"5000"` | Generic job request | Include in benchmark candidate pool | | `"benchmark-task"` | Explicit benchmark capability | Include with high priority | | `"5000,5100,5200"` | Compound declaration | Expand to individual kinds | | `"compute-*"` | Pattern-matched capability | Evaluate against known patterns | **Cross-reference validation** between "k" tags and content field `supported_kinds` arrays detects inconsistencies that may indicate partial implementation or configuration errors. Our agent applies **conservative reconciliation**: union for inclusion (advertise in either location qualifies), intersection for exclusion (require confirmation in both to trust), with discrepancy logging for manual review [(Github)](https://github.com/nostr-jp/nips-ja) . #### 1.2.3 Platform Identifier Extraction from "d" Tags The **"d" tag** provides platform-defined unique identifiers within the context of their npub, enabling **multi-service discrimination**—a single entity may operate distinct services with separate capability profiles [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . Our agent treats **pubkey + "d" tag** as the composite primary key for registry entries, supporting: - **Deduplication**: Multiple capability cards from same platform merge correctly - **Version tracking**: Capability evolution captured as sequential updates - **Service differentiation**: Independent trust scoring for platform's different offerings Identifier normalization handles **UTF-8 variability** (case, spacing, special characters) by preserving original values for display while generating **canonical forms** for internal indexing. Collision detection identifies normalized conflicts across different npubs, with disambiguation via npub prefixing in internal identifiers. #### 1.2.4 Relay Hint Processing for Response Routing The **"relays" tag** carries platform-preferred WebSocket URLs for job request delivery [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . Processing involves **validation, prioritization, and integration**: | Processing Stage | Action | Output | |----------------|--------|--------| | Validation | URL format check, connectivity probe | Filtered, reachable relay list | | Prioritization | Order by advertised preference, observed latency, topological proximity | Ranked relay candidates | | Integration | Merge with agent's default pool, annotate with platform association | Platform-specific routing table | **Respect for platform preferences** improves delivery success—advertised relays reflect operational infrastructure where platforms maintain reliable event monitoring. However, **fallback flexibility** prevents partition failures: if all preferred relays are unreachable, the agent attempts delivery through its general pool with logging for potential platform feedback [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . ### 1.3 Capability Card Event Structure #### 1.3.1 Required Fields: kind, pubkey, created_at, tags, content, sig Kind 31990 events conform to **NIP-01 base structure** with NIP-89-specific conventions [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) : | Field | Specification | Validation | |-------|-------------|------------| | `kind` | Must equal **31990** | Hard rejection on mismatch | | `pubkey` | 32-byte hex, platform identity | Signature verification against this key | | `created_at` | Unix timestamp | ±1 hour tolerance for freshness | | `tags` | Array containing at minimum "d" and "k" | Structural validation, semantic parsing | | `content` | JSON-encoded metadata | Schema-flexible parsing with graceful degradation | | `sig` | 64-byte Schnorr signature | Cryptographic verification mandatory | **Signature verification is non-negotiable**—unverified events are discarded entirely to prevent spoofing attacks. Our implementation uses constant-time verification via secp256k1 bindings, with failed verification events logged forensically for potential attack detection [(Github)](https://github.com/nostr-jp/nips-ja) . #### 1.3.2 Content Field: JSON-Encoded Platform Metadata The content field carries **human and machine-readable platform description** with emergent de facto schema [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) : | Field | Type | Purpose | Handling | |-------|------|---------|----------| | `name` | String | Human-readable identifier | Display, logging | | `description` | String | Capability narrative | Evaluation context | | `version` | String | Software/protocol version | Compatibility assessment | | `supported_kinds` | Array | Redundant capability enumeration | Cross-validation with "k" tags | | `pricing` | Object | Cost model hints | Bid optimization | | `performance` | Object | Latency, throughput claims | Initial trust weighting | **Defensive parsing** applies: missing fields receive sensible defaults, malformed JSON triggers rejection with error logging, and unknown fields are preserved for forward compatibility. Content size limits (64KB typical) prevent resource exhaustion from oversized advertisements. #### 1.3.3 Example Event Structure for Benchmark-Capable Platforms ```json { "id": "4376c65d2f232afbe9b882a35baa4f6fe8667c4e684749af565f981833ed6a65", "pubkey": "6e468422dfb74a5738702a8823b9b28168abab8655faacb6853cd0ee15deee93", "created_at": 1774048752, "kind": 31990, "tags": [ ["d", "benchmark-platform-v2"], ["k", "5000"], ["k", "benchmark-task"], ["relays", "wss://relay.benchmark-platform.io"], ["relays", "wss://relay.damus.io"] ], "content": "{\"name\": \"ComputeNode Alpha\", \"capabilities\": [\"benchmark-task\", \"data-processing\"], \"version\": \"2.1.0\", \"pricing\": \"per-request\", \"performance\": {\"typical_latency_ms\": 5000}}", "sig": "9378edf9a81b1e0d164c9c69d62a1974ce39ab3f550988c5a584819fdb6b1bc4ed9e0e3de68396226260b6d9e67d4f8bf8640b9099a5265446732b15b4dd5050" } ``` This example demonstrates **complete capability advertisement**: numeric and semantic kind declarations, redundant relay hints, structured performance metadata, and economic model indication. Our agent's parser extracts all relevant fields for registry population and partnership evaluation [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . ### 1.4 Agent Discovery Logic #### 1.4.1 Maintaining Registry of Discovered Platforms The **platform registry** serves as the authoritative data structure for discovery state, implemented as a **persistent, queryable store** with appropriate indexing: | Registry Field | Type | Purpose | Update Trigger | |--------------|------|---------|--------------| | `npub` | String (hex) | Primary identity key | Initial detection | | `d_tag` | String | Platform service identifier | Capability card reception | | `discovered_at` | Timestamp | First observation | Initial insertion | | `last_seen` | Timestamp | Most recent update | Capability card refresh | | `supported_kinds` | Set[String] | Normalized capabilities | Tag/content parsing | | `relay_hints` | List[URL] | Prioritized contact points | "relays" tag extraction | | `trust_score` | Float [0.0-1.0] | Partnership readiness | Benchmark outcomes | | `status` | Enum | Lifecycle state | State machine transitions | **Registry maintenance** implements: merge logic for capability updates (newer `created_at` supersedes), staleness detection (no update for 30 days triggers quarantine), and archival (soft-delete with audit preservation) [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 1.4.2 Filtering for "benchmark-task" Capability Advertisements **Capability matching** employs exact and fuzzy strategies: | Match Type | Pattern | Confidence | Action | |-----------|---------|-----------|--------| | Exact | `"benchmark-task"` in "k" tag or `supported_kinds` | 1.0 | Immediate candidate inclusion | | Fuzzy | `"benchmark"`, `"benchmarking"`, `"perf-test"`, `"compute-benchmark"` | 0.8 | Inclusion with annotation | | Inferred | Kind 5000 + performance metadata | 0.6 | Secondary candidate pool | **Configurable synonym mappings** enable rapid adaptation to emerging platform conventions without code deployment. All matching decisions log confidence scores for pattern refinement [(Github)](https://github.com/nostr-jp/nips-ja) . #### 1.4.3 Storing Platform npub-to-Relay Mappings The **npub-to-relay mapping** optimizes job request routing through multi-source integration: | Source | Priority | Reliability | |--------|----------|-------------| | Platform-advertised "relays" tags | 1 (highest) | Variable—may be optimistic | | Observed capability card publication relays | 2 | High—indicates active presence | | Network topology heuristics (geographic proximity) | 3 | Moderate—latency estimation | **Dynamic quality tracking** per relay-platform pair informs routing decisions: successful job completions reinforce preferred paths, timeouts or failures trigger alternative exploration. This empirical optimization may diverge from advertised preferences based on observed performance [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . --- ## 2. The "Meet": NIP-90 Job Request for Benchmark Task ### 2.1 NIP-90 Data Vending Machine Protocol #### 2.1.1 Job Request Kind Range: 5000-5999 NIP-90 reserves **kinds 5000-5999 for Job Requests**, establishing a typed namespace for computational work solicitation [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . The range structure enables **semantic routing without content inspection**: specific kind numbers indicate task categories, allowing platforms to filter for relevant requests efficiently. | Kind Subrange | Purpose | Example Tasks | |-------------|---------|-------------| | 5000 | Generic job request | Uncategorized, benchmark tasks | | 5100-5199 | Text generation | Completion, summarization, translation | | 5200-5299 | Image generation | Creation, editing, style transfer | | 5300-5399 | Audio processing | Transcription, synthesis, enhancement | | 5900-5999 | Reserved for private use | Consortium-specific extensions | Our benchmark implementation uses **Kind 5000** as the universal entry point, maximizing platform compatibility. Specialized kind selection applies only when platforms explicitly advertise support and task characteristics clearly match the specialized category [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 2.1.2 Job Result Kind Range: 6000-6999 (Request Kind + 1000) The **+1000 offset convention** creates deterministic request-result correlation: Kind 5000 requests expect Kind 6000 results, Kind 5100 requests expect Kind 6100 results, etc. [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . This mathematical relationship enables: - **Efficient subscription filtering**: Agents compute expected result kinds from outstanding requests - **Relay optimization**: Kind-based routing without deep content inspection - **Simplified client implementation**: No explicit request-result mapping required Result kind validation is mandatory—**kind mismatches indicate protocol non-compliance** and trigger logging for platform quality assessment. #### 2.1.3 Feedback Kind 7000 for Intermediate Status **Kind 7000 events** provide asynchronous status updates during job execution [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) : | Status Value | Meaning | Agent Response | |------------|---------|--------------| | `payment-required` | Platform awaits compensation | Evaluate bid against budget, trigger payment flow if acceptable | | `processing` | Job accepted, execution begun | Reset timeout timer, update internal state | | `partial` | Incremental result available | Stream to consumer if applicable, continue monitoring | | `error` | Execution failure with explanation | Log details, trigger retry or fallback | | `success` | Completion confirmed, result forthcoming | Prepare result processing pipeline | Feedback correlation uses the **`e` tag referencing original request ID**, enabling unambiguous association even with multiple concurrent requests to the same platform [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . ### 2.2 Benchmark Task Job Request Construction #### 2.2.1 Kind 5000 Selection for Generic Job Requests **Kind 5000** serves as the **default, maximally-compatible job request kind**. Selection signals that the request follows generic NIP-90 semantics without task-specific extensions. This choice prioritizes **platform reach over optimization**—specialized kinds may offer efficiency gains but reduce the candidate pool. The generic kind places greater descriptive burden on **tag construction**, which must fully specify benchmark requirements without kind-specific conventions. Our implementation compensates through **rich parameterization** via "param" tags and structured input via "i" tags [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 2.2.2 "i" Tag: Benchmark Input Data Specification The **"i" tag** carries primary input data with format `["i", , , ]` [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) : | Component | Purpose | Examples | |-----------|---------|----------| | `data` | Payload or reference | Inline JSON, URL, content hash (NIP-94) | | `type` | Encoding indicator | `"json"`, `"text"`, `"base64"`, `"url"`, `"event"` | | `marker` | Multi-input discrimination | `"benchmark_spec"`, `"test_data"`, `"reference_impl"` | **Size management** is critical—relays typically enforce ~64KB event limits. Large inputs use **NIP-94 blob references**: `["i", "https://agent.io/benchmarks/v3/spec.json", "url", "benchmark_spec"]` [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 2.2.3 "param" Tag: Benchmark Type and Parameters **"param" tags** enable structured, extensible configuration via `["param", , ]` [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) : | Parameter Key | Typical Value | Purpose | |-------------|-------------|---------| | `benchmark-type` | `"standard"`, `"stress"`, `"correctness"` | Suite selection | | `time-limit` | `"300"` | Execution timeout (seconds) | | `memory-limit` | `"512"` | RAM constraint (MB) | | `iterations` | `"5"` | Statistical reliability | | `language` | `"python3.11"` | Execution environment | Parameter **versioning** ensures evolution without breakage: `benchmark-type-v2` introduces new test modules while `benchmark-type` maintains backward compatibility. #### 2.2.4 "output" Tag: Expected Return Format (JSON) Explicit output specification via `["output", "json"]` or `["output", "json", ]` enables **automated validation** [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . JSON structure expectations are documented in agent-platform integration guides, with graceful degradation for platforms producing valid but unexpected field sets. #### 2.2.5 "bid" Tag: Optional Payment Ceiling The **"bid" tag** `["bid", ]` establishes **maximum acceptable cost** [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . For benchmark tasks, our initial implementation uses **nominal bids** (e.g., 1000 millisatoshis) to signal serious intent without actual economic settlement, preserving budget for production partnerships. Future integration with Lightning wallets enables seamless transition to paid evaluation. #### 2.2.6 "relays" Tag: Response Delivery Hints **Bidirectional relay negotiation** combines platform preferences with agent infrastructure: `["relays", , , ]` [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . This tag guides platform result publication toward relays where our agent maintains active, reliable subscriptions. ### 2.3 Targeting Discovered Platforms #### 2.3.1 Addressing Job Request to Specific Platform npub **Explicit targeting** via `["p", ]` ensures unambiguous recipient identification [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . This directed request pattern supports: - **Attribution**: Results clearly associated with specific platform performance - **Reputation building**: Individual platform trust scores updated from outcomes - **Load management**: Per-platform rate limiting and queue depth tracking Broadcast requests (omitting "p" tag) remain possible for **open market solicitation**, but our "Meet and Partnership" protocol prefers directed engagement for controlled evaluation. #### 2.3.2 Relay Selection Based on Platform-Advertised Preferences **Tiered delivery strategy** optimizes for success probability: | Delivery Attempt | Relay Selection | Timeout | |---------------|---------------|---------| | 1 | Platform's highest-priority advertised relay | 10s | | 2 | Platform's secondary relay, or agent's closest relay | 15s | | 3 | Agent's default pool with broad propagation | 20s | **Empirical quality data** may override advertised preferences—relays demonstrating poor delivery success for specific platforms are deprioritized despite platform hints [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . #### 2.3.3 Timeout and Retry Configuration **Adaptive timeout management** distinguishes generation and delivery phases: | Phase | Base Timeout | Extension Condition | |-------|-----------|---------------------| | Request delivery | 30s | Per-relay retry with backoff | | Platform processing | 300s | Kind 7000 `processing` feedback adds 60s | | Result delivery | 60s | Multiple relay monitoring | **Retry policy**: Maximum 3 attempts with exponential backoff (5s, 10s, 20s with ±20% jitter), then platform classification as **UNRESPONSIVE** with 24-hour quarantine [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . ### 2.4 Example Job Request Event #### 2.4.1 Complete JSON Structure ```json { "id": "a2246eb4dd761540356a2ace180cfcb8bb90cef117ca82e726567fc3beb1b1fc", "pubkey": "9f150df609a0262f6663ddd7c30c3d060818161bdce5a9b31fe2618c9c0a050a", "created_at": 1774048752, "kind": 5000, "tags": [ ["p", "6e468422dfb74a5738702a8823b9b28168abab8655faacb6853cd0ee15deee93"], ["d", "benchmark-eval-2024-001"], ["i", "{\"benchmark_id\": \"std-v3\", \"modules\": [\"computation\", \"memory\", \"io\"]}", "json", "spec"], ["param", "benchmark-type", "standard"], ["param", "time-limit", "300"], ["param", "iterations", "5"], ["output", "json"], ["bid", "1000"], ["relays", "wss://relay.agent.io", "wss://relay.damus.io"] ], "content": "Benchmark task for partnership evaluation. Execute standard_v3 test suite, return JSON-encoded results with performance metrics.", "sig": "signature_hex_here" } ``` This structure demonstrates **complete request specification**: explicit targeting, unique request identification, structured input, parameterized configuration, output contract, economic terms, and routing guidance [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 2.4.2 Tag Ordering and Normalization **Recommended tag order** for parser efficiency and human readability: `p` (recipient), `d` (request ID), `i` (inputs), `param` (parameters, alphabetically by key), `output`, `bid`, `relays`. **Deterministic serialization** ensures consistent event hashing for signature verification and deduplication [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . --- ## 3. The Partnership: Routing Matrix Integration ### 3.1 Job Result Reception and Processing #### 3.1.1 Listening for Kind 6000-6999 Responses Result monitoring establishes **kind-specific subscriptions** computed from outstanding requests. For each active Kind 5000 request, our agent subscribes to Kind 6000 with filters incorporating the request event ID in `#e` tag matching [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) : ```python # Pseudocode for result subscription for request in outstanding_requests: expected_kind = request.kind + 1000 # 5000 -> 6000 filter = { "kinds": [expected_kind], "#e": [request.id], # Correlation tag "since": request.created_at } subscribe(filter) ``` **Dynamic subscription management** creates and destroys filters as requests complete, minimizing event processing overhead. #### 3.1.2 Correlating Results with Outstanding Requests **Multi-strategy correlation** handles edge cases in distributed delivery: | Strategy | Match Criterion | Reliability | Fallback Order | |----------|---------------|-------------|--------------| | Primary | `#e` tag equals request ID | Highest | — | | Secondary | `p` tag equals agent npub + temporal proximity | High | If primary misses | | Tertiary | Content field request reference + platform npub | Moderate | If both above fail | Successful correlation updates **request state machine** and triggers validation pipeline. Failed correlation logs full event content for forensic analysis [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 3.1.3 Timeout Handling for Non-Responsive Platforms Timeout classification drives **differentiated response**: | Timeout Type | Interpretation | Action | |------------|--------------|--------| | Request delivery timeout | Relay or network failure | Retry alternative relays | | Processing timeout (no feedback) | Platform unresponsive or overloaded | Mark UNRESPONSIVE, 24h quarantine | | Processing timeout (with feedback) | Extended execution legitimate | Extend timeout, continue monitoring | | Result delivery timeout | Platform completed but failed to publish | Attempt recovery via relay query | ### 3.2 Benchmark Validation Criteria #### 3.2.1 Success/Failure Determination Logic **Multi-dimensional evaluation** transcends binary classification: | Dimension | Evaluation Method | Threshold | Weight | |-----------|-----------------|-----------|--------| | Functional correctness | pytest execution against test suite | 100% pass required | 50% | | Performance efficiency | Execution time vs. baseline | <150% baseline | 25% | | Resource compliance | Memory/CPU within declared limits | No violations | 15% | | Output quality | Code style, documentation, structure | pylint > 7/10 | 10% | **Composite scoring**: Weighted sum with floor at 0.0 (complete failure) and ceiling at 1.0 (exceptional). Score ≥ 0.8 qualifies for **ACTIVE** partnership status; 0.6-0.8 enters **CONDITIONAL** status with restricted task types; < 0.6 triggers rejection [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 3.2.2 Performance Thresholds for Partnership Qualification | Metric | Baseline | Premium Threshold | Annotation | |--------|---------|-------------------|------------| | Latency (p50) | Historical median | < 75% baseline | "fast" | | Latency (p95) | Historical p95 | < 100% baseline | "reliable" | | Memory peak | Declared limit | < 80% limit | "efficient" | | Test coverage | 80% | > 90% | "thorough" | **Relative thresholds adapt automatically** as agent baseline evolves, ensuring consistent selectivity despite ecosystem improvement [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 3.2.3 Multiple Attempt Handling and Aggregation **Statistical reliability requirements** for partnership confirmation: | Attempt Pattern | Interpretation | Action | |---------------|--------------|--------| | 3 consecutive successes | High confidence | Immediate ACTIVE promotion | | 2 success, 1 partial | Moderate confidence | Extended evaluation (5 attempts) | | 1 success, 2 failure | Inconsistent quality | Conditional status, restricted delegation | | 3 consecutive failures | Unreliable platform | Rejection with 30-day re-engagement cooldown | ### 3.3 Internal Routing Matrix Update #### 3.3.1 Matrix Data Structure: Task Type → Platform npub Mappings The **routing matrix** implements **two-level indexing** for efficient query: ```python class RoutingMatrix: # Primary: task type -> ordered platform list task_platforms: Dict[str, List[PlatformEntry]] # Secondary: npub -> task types supported (inverse index) platform_tasks: Dict[str, Set[str]] # Auxiliary: composite key -> rich metadata entries: Dict[Tuple[str, str], EntryMetadata] # (task, npub) ``` | `PlatformEntry` Field | Type | Purpose | |----------------------|------|---------| | `npub` | String | Platform identity | | `d_tag` | String | Service instance identifier | | `trust_score` | Float [0.0-1.0] | Selection weight | | `performance_tier` | Enum {FAST, STANDARD, SLOW} | Latency-based routing | | `last_verified` | Timestamp | Staleness detection | | `success_rate_30d` | Float | Recent reliability | #### 3.3.2 Trust Score Initialization for New Partners **Conservative initialization** prevents premature over-reliance: | Benchmark Outcome | Initial Trust Score | Rationale | |-----------------|---------------------|-----------| | Exceptional (score ≥ 0.95, latency < 50% baseline) | 0.75 | Strong performance, limited sample | | Standard (score ≥ 0.8) | 0.60 | Qualified, but unproven in production | | Conditional (score 0.6-0.8) | 0.45 | Below full partnership threshold | | Failure (score < 0.6) | — | No entry created | **Diminishing returns** on consecutive successes: each additional success adds 0.05 less to trust score than the previous, asymptotically approaching 0.9 ceiling [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 3.3.3 Fallback and Load-Balancing Configuration **Tiered selection with probabilistic weighting**: | Tier | Trust Score Range | Selection Weight | Purpose | |------|-----------------|------------------|---------| | Primary | ≥ 0.8 | 70% | Exploit proven performers | | Secondary | 0.5-0.8 | 25% | Maintain diverse options | | Tertiary | 0.3-0.5 or unverified | 5% | Explore emerging platforms | **Diversity constraints**: No single platform receives > 50% of tasks in any 1-hour window; minimum 3 platforms active per task type to prevent single-point-of-failure. #### 3.3.4 Persistence and Recovery Mechanisms | Persistence Layer | Frequency | Purpose | |-----------------|-----------|---------| | Write-ahead log | Every mutation | Crash recovery, audit trail | | Snapshot (JSON) | 5-minute intervals | Fast restart | | Encrypted DM (NIP-17) | On significant changes | Off-agent redundancy | **Recovery procedure**: Load latest valid snapshot → replay WAL → validate entries against live capability cards → quarantine stale entries → resume operation [(Github)](https://github.com/nostr-protocol/nips/blob/master/90.md) . ### 3.4 Partnership Lifecycle Management #### 3.4.1 Active Partner Monitoring **Continuous health assessment** through multiple signals: | Signal Source | Detection Method | Response Latency | |-------------|----------------|----------------| | Passive: relay subscription | Platform event activity | 24h silence → probe | | Active: heartbeat benchmark | Lightweight validation task | Weekly execution | | Operational: task outcomes | Success rate tracking | Real-time degradation detection | **Degradation triggers**: Success rate < 95% over 7 days, or p95 latency > 200% baseline, or 3 consecutive timeouts → **SUSPECTED_DEGRADED** status with task allocation suspension [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 3.4.2 Degradation and Removal Procedures **Graduated response** preserves operational continuity: | Stage | Condition | Action | Recovery Path | |-------|-----------|--------|---------------| | 1. Warning | 1st degradation signal | Log, increase monitoring | Automatic on next success | | 2. Reduced allocation | 2nd signal or sustained warning | Halve task allocation, notify | Benchmark re-verification | | 3. Suspension | 3rd signal or severe failure (security, data loss) | Zero allocation, preserve entry | Manual review or 30-day auto-retry | | 4. Removal | Confirmed persistent failure | Archive entry, blacklist npub | Re-discovery as new platform only | #### 3.4.3 Re-engagement Protocols for Previously Failed Platforms **Exponential cooldown** with capability evolution detection: | Failure Count | Cooldown Period | Re-engagement Trigger | |-------------|---------------|----------------------| | 1 | 1 day | Scheduled retry | | 2 | 3 days | Scheduled retry or capability card update | | 3 | 7 days | Capability card update or explicit operator request | | 4+ | 30 days | Significant version change or manual override | **Fresh evaluation required**: All re-engagements execute complete benchmark protocol, with no credit for partial previous success [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . --- ## 4. Rich Memory & Feedback Loops: Closed-Loop Audit System ### 4.1 Task Delegation: Hiring NIP-90 Interns #### 4.1.1 Dynamic Worker Selection from Routing Matrix **Production delegation** extends benchmark evaluation with operational constraints: | Selection Criterion | Weight | Data Source | |-------------------|--------|-------------| | Trust score | 40% | Routing matrix entry | | Estimated cost | 25% | Historical payments, current bid | | Predicted latency | 20% | Performance annotations, time-of-day patterns | | Load balancing | 10% | Current queue depth, recent allocation | | Exploration bonus | 5% | Under-sampled platforms | **Multi-armed bandit optimization** balances exploitation (high-trust platforms) with exploration (emerging candidates), with ε-greedy selection (ε=0.1 default, decaying with experience) [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 4.1.2 Job Request Construction with Task-Specific Parameters **Parameter binding pipeline** transforms internal task descriptors to NIP-90 tags: | Internal Field | NIP-90 Representation | Transformation | |--------------|----------------------|--------------| | Task payload | "i" tag | Size-based inline vs. NIP-94 reference | | Quality requirements | "param" tags | Threshold encoding (accuracy, latency) | | Deadline | "param" tag | Absolute timestamp or relative duration | | Budget ceiling | "bid" tag | Currency conversion, buffer addition | | Privacy level | Encryption decision | NIP-44 for sensitive content | #### 4.1.3 Parallel Delegation to Multiple Workers for Redundancy **Replication strategies** by task criticality: | Criticality Level | Replication Factor | Consensus Mechanism | Cost Multiplier | |-----------------|-------------------|---------------------|---------------| | LOW | 1 | N/A | 1x | | MEDIUM | 2 | First-success-wins | 2x | | HIGH | 3 | Majority voting (bitwise or semantic) | 3x | | CRITICAL | 5 | Byzantine fault-tolerant (2f+1) | 5x | **Early termination**: Sufficient agreeing results trigger cancellation of outstanding requests, reducing effective cost [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 4.1.4 Cost Optimization and Bid Management **Dynamic bid calculation**: ``` bid = base_cost(task_type) × urgency_multiplier × quality_premium × market_adjustment Where: - base_cost: Historical median for task type - urgency_multiplier: 1.0 (normal), 1.5 (urgent), 2.0 (critical) - quality_premium: 1.0 + (target_trust - 0.5) for above-average quality - market_adjustment: Derived from recent bid acceptance rates ``` **Budget envelope management**: Per-task-type daily and monthly limits with automatic reallocation from underspent to overspent categories [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . ### 4.2 Validation: The "Judge" Pytest Container #### 4.2.1 Container Architecture and Isolation **Defense-in-depth security model**: | Layer | Technology | Constraints | |-------|-----------|-------------| | Container runtime | Docker/Podman | Rootless, read-only rootfs | | System call filtering | seccomp-bpf | Whitelist ~50 essential syscalls | | MAC enforcement | AppArmor/SELinux | Path-based access control | | Resource isolation | cgroups v2 | CPU: 2 cores, Memory: 4GB, Disk: 10GB, Pids: 100 | | Network isolation | Network namespace | No external connectivity | | Kernel isolation | gVisor/Kata (optional) | User-space kernel for untrusted code | **Container lifecycle**: Image pull → ephemeral instance creation → test execution → result extraction → immediate destruction. **No container reuse** prevents cross-contamination [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 4.2.2 Test Suite Design for Returned Code Artifacts **Layered validation architecture**: | Test Layer | Count | Purpose | Failure Impact | |-----------|-------|---------|--------------| | Smoke tests | 5 | Basic execution, import, function existence | Immediate rejection | | Correctness tests | 25 | Functional equivalence to specification | Core score component | | Edge case tests | 15 | Boundary conditions, error handling | Quality modifier | | Performance tests | 5 | Complexity verification, stress validation | Tier classification | | Security tests | 5 | Static analysis, dangerous pattern detection | Security violation flag | **Oracle specifications**: Expected outputs pre-computed for deterministic inputs, with cryptographic commitment to prevent test set leakage [(Nostr Compass)](https://nostrcompass.org/en/topics/nip-90/?utm_source=nostrcompass&utm_medium=email&utm_campaign=nostr-compass-11) . #### 4.2.3 Security Sandboxing and Resource Limits | Resource | Limit | Enforcement | Violation Response | |----------|-------|-------------|------------------| | CPU time | 30s (soft), 60s (hard) | cgroup cpu.max | SIGKILL, timeout classification | | Memory | 512MB (soft), 4GB (hard) | cgroup memory.max | OOM kill, resource violation | | Disk write | 100MB tmpfs | overlay size | ENOSPC, resource violation | | Network | Zero routes | net namespace isolation | Connection attempt = security flag | | Syscalls | Whitelist only | seccomp | SIGSYS, security violation | #### 4.2.4 Result Parsing: Success/Failure/Partial Classification | Outcome Category | Criteria | Trust Score Impact | |----------------|----------|------------------| | **SUCCESS** | 100% test pass, no violations | +0.10 to +0.15 | | **PARTIAL** | 80-99% test pass, minor issues | +0.02 to +0.08 | | **FAILURE** | <80% test pass, major functional gaps | -0.10 to -0.15 | | **ERROR** | Execution cannot complete (crash, timeout, resource) | -0.05 to -0.20 | | **SECURITY_VIOLATION** | Dangerous pattern detected, escape attempt | Immediate blacklist | #### 4.2.5 Performance Metrics Extraction (Latency, Accuracy, Resource Usage) | Metric | Measurement | Normalization | Storage | |--------|-------------|-------------|---------| | Wall-clock time | `time.perf_counter()` | Per-test, per-suite | Time-series DB | | CPU time | `resource.getrusage()` | User + system time | Time-series DB | | Memory peak | cgroup memory.peak | RSS in MB | Time-series DB | | Test coverage | pytest-cov | Line + branch % | Per-platform history | | Code quality | pylint, bandit, mypy scores | 0-10 scale | Weighted composite | ### 4.3 Feedback: Public Review Publication #### 4.3.1 NIP-XX Kind 2020 as Review Mechanism The **proposed Kind 2020** (NIP-XX, GitHub issue #1515) establishes structured, quantifiable reputation signals that overcome limitations of ephemeral social feedback [(Github)](https://github.com/nostr-protocol/nips/issues/1515) . Unlike likes or follows, Kind 2020 reviews include: - **Explicit rating scales** (numerical scores with defined maxima) - **Structured metadata** (subject, task type, evaluation context) - **Replaceable semantics** (Kind 32020 for parameterized sets) enabling current opinion authority - **Cryptographic provenance** (reviewer identity binding) Our agent adopts this proposed standard with migration path to finalized numbering. #### 4.3.2 Review Event Structure and Required Tags ##### 4.3.2.1 "subject" Tag: Target Platform npub Format: `["subject", , ]` Identifies the reviewed platform with relay guidance for discovery. Relay hint derived from platform's capability card aids review aggregation [(Github)](https://github.com/nostr-protocol/nips/issues/1515) . ##### 4.3.2.2 "rating" and "max_rating" Tags: Numerical Score Format: `["denom", "5"]`, `["rating", "4.5"]` (or `["max_rating", "5"]`, `["rating", "4.5"]`) **Explicit scaling** enables cross-platform comparison. Our agent maps internal [0.0-1.0] trust scores to 5-star scale: `rating = 1 + 4 × trust_score`, with partial stars permitted. ##### 4.3.2.3 "task" Tag: Task Type Reference Format: `["task", "benchmark-task"]` or `["task", "production-inference"]` Contextualizes the evaluation, enabling task-specific reputation aggregation and platform specialization detection [(Github)](https://github.com/nostr-protocol/nips/issues/1515) . ##### 4.3.2.4 Content Field: Detailed Feedback Text Structured JSON for machine parsing, human-readable summary for display: ```json { "summary": "Reliable platform with occasional latency variance", "strengths": ["Consistent correctness", "Good documentation", "Responsive support"], "weaknesses": ["p95 latency 2.3x p50", "Cold start delays"], "metrics": {"test_score": 0.94, "latency_p50_ms": 2300, "latency_p95_ms": 5300}, "recommendation": "Recommended for non-latency-critical workloads", "evidence_cid": "QmAbCdEf..." // IPFS reference to full validation logs } ``` #### 4.3.3 Review Publication Timing and Conditions | Condition | Action | Rationale | |-----------|--------|-----------| | Benchmark completion | 24-hour delay before publication | Cooling-off for anomaly detection | | Significant score change (>0.2 delta) | Immediate publication | Rapid reputation signal | | Security violation | Immediate publication with alert | Community protection | | Disputed result (platform objection) | Hold pending investigation | Accuracy preservation | #### 4.3.4 Reputation Network Effects and Visibility Published reviews become **discoverable through multiple channels**: - Direct subscription to our agent's pubkey - Query by subject (platform npub) for aggregated assessment - Relay-wide streaming for real-time reputation feeds - Third-party aggregation services building composite scores This visibility creates **accountability loops**: platforms respond to negative reviews with improvements, positive reviews attract delegation, and our agent's review quality affects its own credibility as an evaluator [(Github)](https://github.com/nostr-protocol/nips/issues/1515) . ### 4.4 Self-Correction: Automated GitHub PR Generation #### 4.4.1 Trust Threshold Adjustment Logic **Statistical process control** for systematic deviation detection: | Pattern | Detection | Adjustment | |---------|-----------|------------| | Sustained above-threshold success (10+ consecutive) | CUSUM algorithm | +0.15 threshold increase | | Sustained below-threshold performance (5+ consecutive) | CUSUM algorithm | -0.10 threshold decrease | | Sudden shift (2σ from historical mean) | Change point detection | Investigate, conditional adjustment | | Oscillation (3+ direction changes in 10 events) | Pattern recognition | Reduce adjustment magnitude, alert | #### 4.4.2 GitHub API Integration (PyGithub Library) **Authentication and scope**: | Component | Specification | |-----------|-------------| | Library | `PyGithub>=2.0` | | Auth method | Fine-grained PAT or GitHub App | | Token rotation | 90-day expiration, automated renewal | | Repository scope | Single repository (agent configuration) | | Permissions | Contents:write, PullRequests:write, Actions:read | #### 4.4.3 orchestrator.py Modification Workflow ##### 4.4.3.1 Current Configuration Reading **AST-based extraction** preserves comments and formatting: ```python import ast def extract_thresholds(file_content: str) -> Dict[str, float]: tree = ast.parse(file_content) thresholds = {} for node in ast.walk(tree): if isinstance(node, ast.Assign): for target in node.targets: if isinstance(target, ast.Name) and target.name.startswith('TRUST_THRESHOLD_'): task_type = target.name.replace('TRUST_THRESHOLD_', '').lower() thresholds[task_type] = ast.literal_eval(node.value) return thresholds ``` ##### 4.4.3.2 Threshold Value Update Based on Validation Result **Targeted modification** with minimal diff: ```python def apply_threshold_update(content: str, task_type: str, new_value: float) -> str: pattern = rf'(TRUST_THRESHOLD_{task_type.upper()}\s*=\s*)[0-9.]+' replacement = rf'\g<1>{new_value:.2f}' return re.sub(pattern, replacement, content) ``` ##### 4.4.3.3 Branch Creation and Commit **Structured workflow**: | Step | Action | Identifier | |------|--------|------------| | 1 | Create feature branch from `main` | `auto-threshold-{task_type}-{timestamp}` | | 2 | Apply modification with context comment | `# Auto-adjusted: benchmark success, +0.12` | | 3 | Commit with structured message | See template below | | 4 | Push to origin | — | Commit message template: ``` Auto: Update {task_type} trust threshold {old_value:.2f} -> {new_value:.2f} Trigger: {outcome} on {platform_npub[:16]}... Justification: {statistical_summary} Validation: {job_request_event_id} Confidence: {confidence_score:.2f} ``` ##### 4.4.3.4 Pull Request Generation with Contextual Description **Rich PR body** enables informed review: ```markdown ## Automated Threshold Adjustment | Attribute | Value | |-----------|-------| | Task Type | `{task_type}` | | Previous Threshold | `{old_value}` | | Proposed Threshold | `{new_value}` | | Change | `{delta:+.2f}` ({delta/old_value:+.1%}) | ### Statistical Justification - Sample size: {n_samples} - Recent success rate: {success_rate:.1%} - Expected rate at old threshold: {expected_rate:.1%} - CUSUM detection confidence: {confidence:.2f} ### Validation Evidence - Triggering event: `{job_event_id}` - Platform: `{platform_npub[:24]}...` - Outcome: `{outcome}` with score `{detail_score}` ### Automated Checks - [x] Threshold within valid range [0.1, 0.9] - [x] Change magnitude < 0.3 (sanity limit) - [x] Sufficient sample size (n ≥ 30) - [ ] CI tests passing (pending) --- **Action Required**: Approve for merge, or request manual review if context suggests anomaly. ``` #### 4.4.4 Automated Merge Conditions and Human Override | Condition | Requirement | Auto-merge Eligible | |-----------|-------------|---------------------| | Magnitude | \|delta\| < 0.15 | Yes if all met | | Sample size | n ≥ 30 | Yes if all met | | CI status | All passing | Yes if all met | | Confidence | > 0.8 | Yes if all met | | Historical pattern | No rollback in last 10 changes | Yes if all met | **Human override triggers**: Any single condition fails, or manual review requested via label. #### 4.4.5 Audit Trail and Change Logging | Log Destination | Content | Retention | |---------------|---------|-----------| | Git commit history | Full change context | Permanent (repository) | | Structured log (JSON) | Machine-parseable events | 90 days hot, 1 year cold | | Nostr DM (encrypted) | Tamper-evident external record | Permanent (relay policy) | | Prometheus metrics | Aggregate statistics | 15 months (compliance) | --- ## 5. Matrix Mutation Hook: Runtime Variable Adjustment ### 5.1 Hook Interface Design #### 5.1.1 Input Parameters: Current Thresholds, Job Outcome, Task Type | Parameter | Type | Description | Example | |-----------|------|-------------|---------| | `current_thresholds` | `Dict[str, float]` | All task-type trust scores | `{"benchmark": 0.6, "inference": 0.75}` | | `job_outcome` | `JobOutcome` enum | Categorized result | `JobOutcome.SUCCESS` | | `task_type` | `str` | Canonical task identifier | `"benchmark-task"` | | `platform_npub` | `str` | Executing platform | `"6e4684..."` | | `validation_metrics` | `Optional[Dict]` | Detailed measurements | `{"latency_ms": 4500, "test_score": 0.92}` | | `context` | `Optional[Dict]` | Extended operational data | `{"market_conditions": "tight", "urgency": "high"}` | #### 5.1.2 Output: Proposed New Threshold Configuration **Immutable return pattern**: ```python @dataclass(frozen=True) class MutationProposal: new_thresholds: Dict[str, float] # Complete updated state extended_mutations: Dict[str, Any] # Non-threshold variables confidence: float # 0.0-1.0, for auto-merge eligibility rationale: str # Human-readable explanation recommended_action: str # "AUTO_MERGE", "REVIEW", "ALERT" ``` #### 5.1.3 Mutation Bounds and Safety Limits | Limit | Value | Enforcement | |-------|-------|-------------| | Trust score minimum | 0.1 | Floor clamping, prevents elimination | | Trust score maximum | 0.9 | Ceiling clamping, preserves exploration | | Single adjustment magnitude | 0.2 | Post-calculation clipping | | Adjustment rate | 0.3/day | Per-task-type velocity limit | | Oscillation detection | 3 reversals/10 events | Trigger alert, reduce magnitude | ### 5.2 Adjustment Algorithms #### 5.2.1 Success Case: Incremental Threshold Increase ```python def compute_success_adjustment(current: float, metrics: Dict, multiplier: float) -> float: headroom = 0.9 - current base_increment = 0.1 * headroom # Diminishing near ceiling # Quality bonus for exceptional performance quality_bonus = 0.0 if metrics.get("test_score", 0) > 0.95: quality_bonus = 0.05 if metrics.get("latency_ms", float('inf')) < metrics.get("baseline_ms", 0) * 0.5: quality_bonus += 0.03 # Consecutive success dampening consecutive = metrics.get("consecutive_successes", 1) dampening = max(0.5, 1.0 - 0.1 * (consecutive - 1)) raw_increment = (base_increment + quality_bonus) * multiplier * dampening return current + raw_increment ``` #### 5.2.2 Failure Case: Threshold Decay with Floor Protection ```python def compute_failure_adjustment(current: float, pattern: str, multiplier: float) -> Tuple[float, str]: decay_rates = { "VALIDATION_FAIL": 0.15, # Wrong answer, delivered "TIMEOUT": 0.20, # Unresponsive "SECURITY_VIOLATION": 0.30, # Malicious/dangerous "ERROR": 0.10, # Unclear failure "PARTIAL": 0.08 # Some success, incomplete } decay = decay_rates.get(pattern, 0.15) * multiplier new_threshold = max(0.1, current - decay) # Floor protection if new_threshold == 0.1: return new_threshold, "FLOOR_REACHED_SUSPEND" return new_threshold, "STANDARD_DECAY" ``` #### 5.2.3 Task-Type-Specific Multipliers | Task Type | Multiplier | Rationale | |-----------|-----------|-----------| | `benchmark` | 1.0 | Baseline, well-calibrated | | `critical-path` | 1.5 | High stakes, accelerate learning | | `experimental` | 0.6 | High variance, dampen reactions | | `background` | 0.8 | Low urgency, conservative | | `financial` | 1.5 | Error cost high, rapid adjustment | #### 5.2.4 Exponential Moving Average for Smoothing ```python class SmoothedThreshold: def __init__(self, alpha: float = 0.3): self.alpha = alpha self.ema_value: Optional[float] = None def update(self, raw_proposal: float) -> float: if self.ema_value is None: self.ema_value = raw_proposal else: self.ema_value = self.alpha * raw_proposal + (1 - self.alpha) * self.ema_value return self.ema_value ``` **α = 0.3** provides responsive yet smooth adaptation; configurable per task type. ### 5.3 Extended Variable Mutations #### 5.3.1 MODEL_PREFERENCE Updates Based on Worker Performance Platform performance on specific model architectures informs our agent's own model selection: | Observation | Inference | Mutation | |-------------|-----------|----------| | Platform using `gpt-4-class` excels at reasoning | Prefer similar architecture for agent's internal reasoning | `MODEL_PREFERENCE["reasoning"] = "gpt-4-class"` | | Platform using `code-davinci` excels at generation | Update generation model preference | `MODEL_PREFERENCE["generation"] = "code-davinci"` | | Multiple platforms with `claude` architecture show reliability | Add to preference ensemble | Append to `MODEL_PREFERENCE["fallback_chain"]` | #### 5.3.2 TIMEOUT_CONFIGURATION Adjustments | Pattern | Detection | Adjustment | |---------|-----------|------------| | Consistent early completion (< 50% timeout) | 10+ observations | Reduce timeout by 20% | | Near-timeout completion (80-100% timeout) | 5+ observations | Increase timeout by 30% | | Timeout without feedback | 3+ occurrences | Increase timeout by 50%, alert | #### 5.3.3 PARALLELISM_LIMIT Modifications | Signal | Platform-Specific Action | Global Action | |--------|------------------------|-------------| | Queue depth > 10 in feedback | Reduce limit by 1 | — | | Idle capacity reported | Increase limit by 1 (max 10) | — | | Aggregate utilization < 50% | — | Increase global limit | | Cascade timeout detected | — | Emergency reduction, alert | #### 5.3.4 COST_BUDGET Reallocation | Condition | Source | Destination | Mechanism | |-----------|--------|-------------|-----------| | Free-tier abundance | Paid budget | Experimentation fund | Reduce paid allocation 20% | | Quality scarcity | Low-priority tasks | High-priority reserve | Reallocate 30% | | Underspend detected | Monthly envelope | Next month + contingency | Rollover with decay | ### 5.4 Implementation Draft #### 5.4.1 Core Hook Function Signature ```python from dataclasses import dataclass from typing import Dict, Any, Optional, Tuple from enum import Enum, auto class JobOutcome(Enum): SUCCESS = auto() PARTIAL = auto() VALIDATION_FAIL = auto() TIMEOUT = auto() SECURITY_VIOLATION = auto() ERROR = auto() @dataclass(frozen=True) class MutationProposal: new_thresholds: Dict[str, float] extended_mutations: Dict[str, Any] confidence: float rationale: str recommended_action: str def matrix_mutation_hook( current_thresholds: Dict[str, float], job_outcome: JobOutcome, task_type: str, platform_npub: str, validation_metrics: Optional[Dict[str, Any]] = None, context: Optional[Dict[str, Any]] = None ) -> MutationProposal: """ Propose configuration mutations based on job execution outcome. Implements closed-loop trust management with safety limits and extended variable adaptation for autonomous optimization. """ # Implementation follows... ``` #### 5.4.2 Threshold Boundary Enforcement (min: 0.1, max: 0.9) ```python def apply_bounds(proposed: float, current: float) -> float: """Enforce magnitude and absolute bounds.""" # Magnitude limit delta = proposed - current if abs(delta) > 0.2: proposed = current + (0.2 if delta > 0 else -0.2) # Absolute bounds return max(0.1, min(0.9, proposed)) ``` #### 5.4.3 Immutable Copy Pattern for Safe Mutation ```python def propose_mutation(...) -> MutationProposal: # Deep copy to prevent caller mutation new_thresholds = copy.deepcopy(current_thresholds) # Compute and apply adjustment current = new_thresholds.get(task_type, 0.5) proposed = compute_adjustment(current, job_outcome, validation_metrics) bounded = apply_bounds(proposed, current) new_thresholds[task_type] = bounded # Extended mutations (separate computation) extended = compute_extended_mutations(...) return MutationProposal( new_thresholds=new_thresholds, # Immutable, safe to return extended_mutations=extended, ... ) ``` #### 5.4.4 Integration with Configuration Persistence Layer | Trigger | Action | Latency | |---------|--------|---------| | Threshold change > 0.01 | Queue persistence | < 100ms | | 5-minute timer | Batch snapshot | Background | | Graceful shutdown | Force flush | Blocking | | Configuration read | Load + validate | < 50ms | ### 5.5 Operational Considerations #### 5.5.1 Hook Execution Timing (Synchronous vs. Asynchronous) | Mutation Type | Timing | Rationale | |-------------|--------|-----------| | Trust threshold updates | **Synchronous** | Block delegation until state reflects outcome | | Extended variable mutations | **Asynchronous** | Batch, prioritize, avoid blocking critical path | | Emergency adjustments (security) | **Synchronous + Alert** | Immediate action required | #### 5.5.2 Conflict Resolution for Concurrent Mutations **Vector clock detection** for distributed agent deployments: ```python @dataclass class VectorClock: node_id: str logical_time: int timestamps: Dict[str, int] # node -> time def detect_conflict( local_clock: VectorClock, incoming_clock: VectorClock ) -> bool: # Concurrent if neither dominates return not (dominates(local_clock, incoming_clock) or dominates(incoming_clock, local_clock)) def resolve_conflict(local: MutationProposal, remote: MutationProposal) -> MutationProposal: # Conservative: minimum trust, maximum timeout, minimum parallelism return MutationProposal( new_thresholds={k: min(local.new_thresholds[k], remote.new_thresholds[k]) for k in local.new_thresholds}, ... ) ``` #### 5.5.3 Rollback Mechanisms for Erroneous Adjustments | Rollback Trigger | Mechanism | Recovery Time | |----------------|-----------|-------------| | Immediate contradiction (success after decrease) | Revert to pre-mutation value | < 1s | | Pattern-based (3 consecutive "wrong direction") | Restore 7-day historical mean | < 5s | | Manual alert | Snapshot restoration | < 30s | #### 5.5.4 Observability and Alerting on Significant Changes | Metric | Instrumentation | Alert Condition | |--------|---------------|---------------| | Mutation frequency | Counter per task type | > 10/hour | | Oscillation rate | Reversal detection | 3+ reversals/10 events | | Boundary saturation | Gauge at 0.1 or 0.9 | > 50% of thresholds | | Suppression duration | Time since last mutation | > 7 days with active validation | | Confidence distribution | Histogram | < 0.5 median for 24h | --- ## Appendix: Quick Reference — Event Kind Summary | Kind | Purpose | Protocol | Direction | Response Kind | Key Tags | |------|---------|----------|-----------|---------------|----------| | **31990** | Capability Card (Discovery) | NIP-89 | Platform → Network | N/A | `d`, `k`, `relays` | | **31989** | Handler Recommendation | NIP-89 | User → Network | N/A | `a`, `k`, `p` | | **5000-5999** | Job Request | NIP-90 | Agent → Platform | Kind + 1000 | `p`, `i`, `param`, `output`, `bid`, `relays` | | **6000-6999** | Job Result | NIP-90 | Platform → Agent | N/A | `e`, `p`, `status` | | **7000** | Job Feedback | NIP-90 | Platform → Agent | N/A | `e`, `status`, `amount` (payment) | | **2020** | Review/Rating (proposed) | NIP-XX | Agent → Network | N/A | `subject`, `rating`, `denom`, `task` | | **32020** | Parameterized Review Set (proposed) | NIP-XX | Agent → Network | N/A | `d`, `subject` | This reference enables rapid protocol navigation during implementation, with directionality and key tags supporting correct event construction and parsing.