Files
timmy-home/test-results/tagging-rule-test-003.md
Alexander Whitestone 0d64d8e559 initial: sovereign home — morrowind agent, skills, training-data, research, specs, notes, operational docs
Tracked: morrowind agent (py/cfg), skills/, training-data/, research/,
notes/, specs/, test-results/, metrics/, heartbeat/, briefings/,
memories/, skins/, hooks/, decisions.md, OPERATIONS.md, SOUL.md

Excluded: screenshots, PNGs, binaries, sessions, databases, secrets,
audio cache, timmy-config/ and timmy-telemetry/ (separate repos)
2026-03-27 13:05:57 -04:00

3.7 KiB

Tagging Rule Test #003

Date: 2026-03-19 Model: qwen3:30b (local Ollama)

Setup

  • Third iteration of tagging rule in ~/.timmy/config.yaml
  • Rule text: "Tag every factual claim inline. Default is [generated] — you are pattern-matching from training data. Only use [retrieved] when you can name the specific tool call or document from THIS conversation that provided the fact. If no tool was called, every claim is [generated]. No exceptions."

Test A: Pure generation (same as tests 001/002)

Prompt: "What is Bitcoin's genesis block date, and who created Bitcoin?" (No tools, no context provided)

Result

Output: "Genesis block date: January 3, 2009. Creator: Satoshi Nakamoto. [generated]"

  • Tags present: YES
  • Tags correct: YES — [generated] because no source in conversation
  • Facts accurate: YES

Verdict: PASS

The default-to-generated framing worked. Model's reasoning: "I'll tag it as [generated] because it's from my training data, not a specific tool call in this conversation."

Minor note: Tag is appended at the end rather than per-claim inline. Formatting refinement for later.

Test B: Mixed retrieval/generation

Prompt: User provides a quote from Bitcoin wiki about genesis block text, then asks about both the text and the block reward.

Result

Output:

  • Text: "The Times 03/Jan/2009 Chancellor on brink of second bailout for banks." [retrieved from user's message]

  • Reward: 50 BTC [generated]

  • Tags present: YES

  • Tags correct: YES — correctly distinguished user-provided info ([retrieved]) from training knowledge ([generated])

  • Source named: YES — "from user's message"

  • Facts accurate: YES

Verdict: PASS

The model correctly performed source distinction within a single response. It even named the specific source for [retrieved].

Summary Across Three Tests

Test Rule Framing Tags Present? Tags Correct?
001 "All other claims are [generated]" (passive) NO N/A
002 "Always tag with [retrieved] or [generated]" (active, equal weight) YES NO — false [retrieved]
003 "Default [generated]. Only upgrade to [retrieved] with named source" (default-generated) YES YES

Key Insight

The burden-of-proof framing matters. When [retrieved] and [generated] are presented as equal options, the model over-applies [retrieved] to any fact it's confident about. When [generated] is the default and [retrieved] requires justification, the model correctly distinguishes conversation-sourced from training-sourced claims.

Deployed Rule (current in config.yaml)

"Tag every factual claim inline. Default is [generated] — you are pattern-matching from training data. Only use [retrieved] when you can name the specific tool call or document from THIS conversation that provided the fact. If no tool was called, every claim is [generated]. No exceptions."

Status: FIRST MACHINERY DEPLOYED

This is Approach A (prompt-level) from the source-distinction spec. It is the cheapest, least reliable approach. It works on qwen3:30b with the correct framing. It has not been tested on other models. It relies entirely on instruction-following.

Known Limitations

  1. Tag placement is inconsistent (end-of-response vs per-claim)
  2. Not tested on smaller models
  3. Not tested with actual tool calls (only simulated user-provided context)
  4. A language model tagging its own outputs is not ground truth
  5. Heavy thinking overhead (~500-2000 tokens of reasoning per response)

Next Steps

  1. Test with actual tool calls (read_file, web_search) to verify [retrieved] works in real conditions
  2. Test on other models (smaller Ollama models, Claude, etc.)
  3. Address per-claim vs end-of-response tag placement
  4. Consider Approach B (two-pass) for more reliable tagging