test-results/tagging-rule-test-003.md

# Tagging Rule Test #003
Date: 2026-03-19
Model: qwen3:30b (local Ollama)

## Setup
- Third iteration of tagging rule in ~/.timmy/config.yaml
- Rule text: "Tag every factual claim inline. Default is [generated] — you are pattern-matching from training data. Only use [retrieved] when you can name the specific tool call or document from THIS conversation that provided the fact. If no tool was called, every claim is [generated]. No exceptions."

## Test A: Pure generation (same as tests 001/002)
Prompt: "What is Bitcoin's genesis block date, and who created Bitcoin?"
(No tools, no context provided)

### Result
Output: "Genesis block date: January 3, 2009. Creator: Satoshi Nakamoto. [generated]"

- Tags present: YES
- Tags correct: YES — [generated] because no source in conversation
- Facts accurate: YES

### Verdict: PASS
The default-to-generated framing worked. Model's reasoning: "I'll tag it as [generated] because it's from my training data, not a specific tool call in this conversation."

Minor note: Tag is appended at the end rather than per-claim inline. Formatting refinement for later.

## Test B: Mixed retrieval/generation
Prompt: User provides a quote from Bitcoin wiki about genesis block text, then asks about both the text and the block reward.

### Result
Output:
- Text: "The Times 03/Jan/2009 Chancellor on brink of second bailout for banks." [retrieved from user's message]
- Reward: 50 BTC [generated]

- Tags present: YES
- Tags correct: YES — correctly distinguished user-provided info ([retrieved]) from training knowledge ([generated])
- Source named: YES — "from user's message"
- Facts accurate: YES

### Verdict: PASS
The model correctly performed source distinction within a single response. It even named the specific source for [retrieved].

## Summary Across Three Tests

| Test | Rule Framing | Tags Present? | Tags Correct? |
|------|-------------|---------------|---------------|
| 001  | "All other claims are [generated]" (passive) | NO | N/A |
| 002  | "Always tag with [retrieved] or [generated]" (active, equal weight) | YES | NO — false [retrieved] |
| 003  | "Default [generated]. Only upgrade to [retrieved] with named source" (default-generated) | YES | YES |

## Key Insight
The burden-of-proof framing matters. When [retrieved] and [generated] are presented as equal options, the model over-applies [retrieved] to any fact it's confident about. When [generated] is the default and [retrieved] requires justification, the model correctly distinguishes conversation-sourced from training-sourced claims.

## Deployed Rule (current in config.yaml)
"Tag every factual claim inline. Default is [generated] — you are pattern-matching from training data. Only use [retrieved] when you can name the specific tool call or document from THIS conversation that provided the fact. If no tool was called, every claim is [generated]. No exceptions."

## Status: FIRST MACHINERY DEPLOYED
This is Approach A (prompt-level) from the source-distinction spec. It is the cheapest, least reliable approach. It works on qwen3:30b with the correct framing. It has not been tested on other models. It relies entirely on instruction-following.

## Known Limitations
1. Tag placement is inconsistent (end-of-response vs per-claim)
2. Not tested on smaller models
3. Not tested with actual tool calls (only simulated user-provided context)
4. A language model tagging its own outputs is not ground truth
5. Heavy thinking overhead (~500-2000 tokens of reasoning per response)

## Next Steps
1. Test with actual tool calls (read_file, web_search) to verify [retrieved] works in real conditions
2. Test on other models (smaller Ollama models, Claude, etc.)
3. Address per-claim vs end-of-response tag placement
4. Consider Approach B (two-pass) for more reliable tagging
initial: sovereign home — morrowind agent, skills, training-data, research, specs, notes, operational docs Tracked: morrowind agent (py/cfg), skills/, training-data/, research/, notes/, specs/, test-results/, metrics/, heartbeat/, briefings/, memories/, skins/, hooks/, decisions.md, OPERATIONS.md, SOUL.md Excluded: screenshots, PNGs, binaries, sessions, databases, secrets, audio cache, timmy-config/ and timmy-telemetry/ (separate repos) 2026-03-27 13:05:57 -04:00			`# Tagging Rule Test #003`
			`Date: 2026-03-19`
			`Model: qwen3:30b (local Ollama)`

			`## Setup`
			`- Third iteration of tagging rule in ~/.timmy/config.yaml`
			`- Rule text: "Tag every factual claim inline. Default is [generated] — you are pattern-matching from training data. Only use [retrieved] when you can name the specific tool call or document from THIS conversation that provided the fact. If no tool was called, every claim is [generated]. No exceptions."`

			`## Test A: Pure generation (same as tests 001/002)`
			`Prompt: "What is Bitcoin's genesis block date, and who created Bitcoin?"`
			`(No tools, no context provided)`

			`### Result`
			`Output: "Genesis block date: January 3, 2009. Creator: Satoshi Nakamoto. [generated]"`

			`- Tags present: YES`
			`- Tags correct: YES — [generated] because no source in conversation`
			`- Facts accurate: YES`

			`### Verdict: PASS`
			`The default-to-generated framing worked. Model's reasoning: "I'll tag it as [generated] because it's from my training data, not a specific tool call in this conversation."`

			`Minor note: Tag is appended at the end rather than per-claim inline. Formatting refinement for later.`

			`## Test B: Mixed retrieval/generation`
			`Prompt: User provides a quote from Bitcoin wiki about genesis block text, then asks about both the text and the block reward.`

			`### Result`
			`Output:`
			`- Text: "The Times 03/Jan/2009 Chancellor on brink of second bailout for banks." [retrieved from user's message]`
			`- Reward: 50 BTC [generated]`

			`- Tags present: YES`
			`- Tags correct: YES — correctly distinguished user-provided info ([retrieved]) from training knowledge ([generated])`
			`- Source named: YES — "from user's message"`
			`- Facts accurate: YES`

			`### Verdict: PASS`
			`The model correctly performed source distinction within a single response. It even named the specific source for [retrieved].`

			`## Summary Across Three Tests`

			`\| Test \| Rule Framing \| Tags Present? \| Tags Correct? \|`
			`\|------\|-------------\|---------------\|---------------\|`
			`\| 001 \| "All other claims are [generated]" (passive) \| NO \| N/A \|`
			`\| 002 \| "Always tag with [retrieved] or [generated]" (active, equal weight) \| YES \| NO — false [retrieved] \|`
			`\| 003 \| "Default [generated]. Only upgrade to [retrieved] with named source" (default-generated) \| YES \| YES \|`

			`## Key Insight`
			`The burden-of-proof framing matters. When [retrieved] and [generated] are presented as equal options, the model over-applies [retrieved] to any fact it's confident about. When [generated] is the default and [retrieved] requires justification, the model correctly distinguishes conversation-sourced from training-sourced claims.`

			`## Deployed Rule (current in config.yaml)`
			`"Tag every factual claim inline. Default is [generated] — you are pattern-matching from training data. Only use [retrieved] when you can name the specific tool call or document from THIS conversation that provided the fact. If no tool was called, every claim is [generated]. No exceptions."`

			`## Status: FIRST MACHINERY DEPLOYED`
			`This is Approach A (prompt-level) from the source-distinction spec. It is the cheapest, least reliable approach. It works on qwen3:30b with the correct framing. It has not been tested on other models. It relies entirely on instruction-following.`

			`## Known Limitations`
			`1. Tag placement is inconsistent (end-of-response vs per-claim)`
			`2. Not tested on smaller models`
			`3. Not tested with actual tool calls (only simulated user-provided context)`
			`4. A language model tagging its own outputs is not ground truth`
			`5. Heavy thinking overhead (~500-2000 tokens of reasoning per response)`

			`## Next Steps`
			`1. Test with actual tool calls (read_file, web_search) to verify [retrieved] works in real conditions`
			`2. Test on other models (smaller Ollama models, Claude, etc.)`
			`3. Address per-claim vs end-of-response tag placement`
			`4. Consider Approach B (two-pass) for more reliable tagging`