Add content moderation pipeline (Llama Guard + game-context prompts) #1056

Closed
opened 2026-03-23 01:38:25 +00:00 by claude · 1 comment
Collaborator

Origin

Extracted from the research PDF attached to #946 ("Timmy Time Integration Architecture: Eight Deep Dives into Real Deployment"), Section 6: Content Moderation.

Problem

An AI narrator generating live speech over game footage needs a content moderation pipeline to prevent harmful output (ref: Neuro-sama incident — 2-week Twitch ban for unfiltered LLM responses).

Proposed Architecture

Three-layer defense:

Layer 1: Game-Context System Prompt

  • Instruct the narrator to treat Morrowind's mature themes (slavery, Skooma/drugs, assassin guilds) as "game mechanics and historical worldbuilding"
  • Never editorialize on real-world parallels
  • Per-game vocabulary whitelists ("Skooma", "slave", "Morag Tong") to prevent over-filtering

Layer 2: Real-Time Output Filter (Llama Guard)

  • Llama Guard 3 1B (INT4 quantized): <30ms per sentence on consumer GPU
  • Run moderation and TTS preprocessing in parallel
  • On pass → TTS synthesis fires immediately
  • On fail → contextual fallback narration (pre-generated per game scene type)

Layer 3: Per-Game Moderation Profiles

  • Threshold tuning via probability scores (ShieldGemma 2B as alternative for higher accuracy)
  • Distinguish "narrating in-game slavery" from "promoting real slavery"
  • Configurable per-game profiles

Model Options (from research)

Model Strength Latency
Llama Guard 3 1B (INT4) Speed <30ms/sentence
ShieldGemma 2B Accuracy (+10.8% AU-PRC) ~50-100ms
LEG (Lightweight Explainable Guardrail) Ultra-fast <8ms
NeMo Guardrails Orchestration framework ~500ms with 5 rails

Streaming Disclosure

  • YouTube: use "Altered or synthetic content" toggle
  • Twitch: standard community guidelines apply (no specific AI disclosure req as of March 2026)

References

  • Issue #946 (source research PDF)
  • Neuro-sama incident (December 2022) — cautionary tale for unfiltered AI output
  • #963 (Perception Bottleneck — related pipeline)
  • #966 (Three-tier router — related model selection)
## Origin Extracted from the research PDF attached to #946 ("Timmy Time Integration Architecture: Eight Deep Dives into Real Deployment"), Section 6: Content Moderation. ## Problem An AI narrator generating live speech over game footage needs a content moderation pipeline to prevent harmful output (ref: Neuro-sama incident — 2-week Twitch ban for unfiltered LLM responses). ## Proposed Architecture Three-layer defense: ### Layer 1: Game-Context System Prompt - Instruct the narrator to treat Morrowind's mature themes (slavery, Skooma/drugs, assassin guilds) as "game mechanics and historical worldbuilding" - Never editorialize on real-world parallels - Per-game vocabulary whitelists ("Skooma", "slave", "Morag Tong") to prevent over-filtering ### Layer 2: Real-Time Output Filter (Llama Guard) - **Llama Guard 3 1B** (INT4 quantized): <30ms per sentence on consumer GPU - Run moderation and TTS preprocessing in parallel - On pass → TTS synthesis fires immediately - On fail → contextual fallback narration (pre-generated per game scene type) ### Layer 3: Per-Game Moderation Profiles - Threshold tuning via probability scores (ShieldGemma 2B as alternative for higher accuracy) - Distinguish "narrating in-game slavery" from "promoting real slavery" - Configurable per-game profiles ## Model Options (from research) | Model | Strength | Latency | |-------|----------|--------| | Llama Guard 3 1B (INT4) | Speed | <30ms/sentence | | ShieldGemma 2B | Accuracy (+10.8% AU-PRC) | ~50-100ms | | LEG (Lightweight Explainable Guardrail) | Ultra-fast | <8ms | | NeMo Guardrails | Orchestration framework | ~500ms with 5 rails | ## Streaming Disclosure - YouTube: use "Altered or synthetic content" toggle - Twitch: standard community guidelines apply (no specific AI disclosure req as of March 2026) ## References - Issue #946 (source research PDF) - Neuro-sama incident (December 2022) — cautionary tale for unfiltered AI output - #963 (Perception Bottleneck — related pipeline) - #966 (Three-tier router — related model selection)
claude self-assigned this 2026-03-23 01:38:50 +00:00
Author
Collaborator

PR #1059 created.

Implemented the three-layer content moderation pipeline:

  1. Game-context system prompts — per-game vocabulary whitelists (Morrowind, Skyrim) prevent over-filtering of in-game terms like Skooma, slave, Morag Tong
  2. Real-time output filter — Llama Guard 3 1B via Ollama with regex fallback for graceful degradation
  3. Per-game threshold tuning — configurable confidence thresholds per profile, contextual fallback narration by scene type

New module: src/infrastructure/guards/ with singleton pattern. Config via config/moderation.yaml. 32 unit tests, all passing.

PR #1059 created. Implemented the three-layer content moderation pipeline: 1. **Game-context system prompts** — per-game vocabulary whitelists (Morrowind, Skyrim) prevent over-filtering of in-game terms like Skooma, slave, Morag Tong 2. **Real-time output filter** — Llama Guard 3 1B via Ollama with regex fallback for graceful degradation 3. **Per-game threshold tuning** — configurable confidence thresholds per profile, contextual fallback narration by scene type New module: `src/infrastructure/guards/` with singleton pattern. Config via `config/moderation.yaml`. 32 unit tests, all passing.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1056