Implement content moderation pipeline (Llama Guard + game-context prompts) #987

Closed
opened 2026-03-22 19:12:48 +00:00 by perplexity · 1 comment
Collaborator

Source

  • #946 — Integration Architecture: Eight Deep Dives (content moderation section)
  • #982 — Session Crystallization Playbook

Objective

Build a real-time content moderation pipeline for Timmy's narration output to prevent a Neuro-sama-style incident while handling Morrowind's mature themes appropriately.

Architecture

Moderation and TTS preprocessing run in parallel: while Llama Guard checks the sentence, Kokoro tokenizes and phonemizes. On pass, TTS fires immediately. On fail, a contextual fallback narration replaces flagged content.

Scope

Model Options (ranked)

  1. Llama Guard 3 1B (INT4): Best for speed. <30ms per sentence on GPU. Run alongside TTS.
  2. ShieldGemma 2B: Best for accuracy. +10.8% AU-PRC over Llama Guard. Threshold tuning via probability scores.
  3. NeMo Guardrails: Framework for parallel guardrails with custom Colang rules (~0.5s with 5 GPU rails).
  4. LEG (Lightweight Explainable Guardrail): Emerging, <8ms, 7x faster.

Game-Context Awareness

  • System prompt instructs narrator to describe slavery as "game mechanic and historical worldbuilding"
  • Drugs described as "in-game consumable items"
  • Per-game moderation threshold profiles whitelist expected vocabulary: "Skooma", "slave", "Morag Tong"
  • Never editorialize on real-world parallels

Fallback Strategy

  • On moderation failure: pre-generated contextual fallback narration (per game scene type)
  • Visible "filtered" indicator (transparency-as-entertainment, per Neuro-sama resolution)

Key Design Notes

  • Neuro-sama earned a 2-week Twitch ban for unfiltered AI output — moderation is non-negotiable
  • Latency budget: <30ms for moderation, parallel with TTS preprocessing
  • Morrowind's themes require nuanced filtering, not blanket censorship
  • #966 — Three-tier metabolic LLM router (narration output feeds into this)
  • #959 — Narration templates (content that needs moderation)
## Source - #946 — Integration Architecture: Eight Deep Dives (content moderation section) - #982 — Session Crystallization Playbook ## Objective Build a real-time content moderation pipeline for Timmy's narration output to prevent a Neuro-sama-style incident while handling Morrowind's mature themes appropriately. ## Architecture Moderation and TTS preprocessing run **in parallel**: while Llama Guard checks the sentence, Kokoro tokenizes and phonemizes. On pass, TTS fires immediately. On fail, a contextual fallback narration replaces flagged content. ## Scope ### Model Options (ranked) 1. **Llama Guard 3 1B (INT4):** Best for speed. <30ms per sentence on GPU. Run alongside TTS. 2. **ShieldGemma 2B:** Best for accuracy. +10.8% AU-PRC over Llama Guard. Threshold tuning via probability scores. 3. **NeMo Guardrails:** Framework for parallel guardrails with custom Colang rules (~0.5s with 5 GPU rails). 4. **LEG (Lightweight Explainable Guardrail):** Emerging, <8ms, 7x faster. ### Game-Context Awareness - System prompt instructs narrator to describe slavery as "game mechanic and historical worldbuilding" - Drugs described as "in-game consumable items" - Per-game moderation threshold profiles whitelist expected vocabulary: "Skooma", "slave", "Morag Tong" - Never editorialize on real-world parallels ### Fallback Strategy - On moderation failure: pre-generated contextual fallback narration (per game scene type) - Visible "filtered" indicator (transparency-as-entertainment, per Neuro-sama resolution) ## Key Design Notes - Neuro-sama earned a 2-week Twitch ban for unfiltered AI output — moderation is non-negotiable - Latency budget: <30ms for moderation, parallel with TTS preprocessing - Morrowind's themes require nuanced filtering, not blanket censorship ## Related - #966 — Three-tier metabolic LLM router (narration output feeds into this) - #959 — Narration templates (content that needs moderation)
claude was assigned by Rockachopa 2026-03-22 23:30:25 +00:00
Collaborator

PR created: http://143.198.27.163:3000/rockachopa/Timmy-time-dashboard/pulls/1055

Implemented content moderation pipeline with:

  • ContentModerator singleton using Llama Guard 3 via Ollama
  • Morrowind game-context profile with whitelisted vocabulary (Skooma, slave, Morag Tong, etc.)
  • Parallel-ready async architecture (asyncio.to_thread())
  • Configurable fail-open/fail-closed degradation
  • Scene-type fallback narrations when content is filtered
  • 40 unit tests, all passing
PR created: http://143.198.27.163:3000/rockachopa/Timmy-time-dashboard/pulls/1055 Implemented content moderation pipeline with: - `ContentModerator` singleton using Llama Guard 3 via Ollama - Morrowind game-context profile with whitelisted vocabulary (Skooma, slave, Morag Tong, etc.) - Parallel-ready async architecture (`asyncio.to_thread()`) - Configurable fail-open/fail-closed degradation - Scene-type fallback narrations when content is filtered - 40 unit tests, all passing
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#987