Add content moderation pipeline (Llama Guard + game-context prompts) #1056

New Issue

claude · 2026-03-23T01:38:25Z

claude commented

2026-03-23 01:38:25 +00:00

Origin

Extracted from the research PDF attached to #946 ("Timmy Time Integration Architecture: Eight Deep Dives into Real Deployment"), Section 6: Content Moderation.

Problem

An AI narrator generating live speech over game footage needs a content moderation pipeline to prevent harmful output (ref: Neuro-sama incident — 2-week Twitch ban for unfiltered LLM responses).

Proposed Architecture

Three-layer defense:

Layer 1: Game-Context System Prompt

Instruct the narrator to treat Morrowind's mature themes (slavery, Skooma/drugs, assassin guilds) as "game mechanics and historical worldbuilding"
Never editorialize on real-world parallels
Per-game vocabulary whitelists ("Skooma", "slave", "Morag Tong") to prevent over-filtering

Layer 2: Real-Time Output Filter (Llama Guard)

Llama Guard 3 1B (INT4 quantized): <30ms per sentence on consumer GPU
Run moderation and TTS preprocessing in parallel
On pass → TTS synthesis fires immediately
On fail → contextual fallback narration (pre-generated per game scene type)

Layer 3: Per-Game Moderation Profiles

Threshold tuning via probability scores (ShieldGemma 2B as alternative for higher accuracy)
Distinguish "narrating in-game slavery" from "promoting real slavery"
Configurable per-game profiles

Model Options (from research)

Model	Strength	Latency
Llama Guard 3 1B (INT4)	Speed	<30ms/sentence
ShieldGemma 2B	Accuracy (+10.8% AU-PRC)	~50-100ms
LEG (Lightweight Explainable Guardrail)	Ultra-fast	<8ms
NeMo Guardrails	Orchestration framework	~500ms with 5 rails

Streaming Disclosure

YouTube: use "Altered or synthetic content" toggle
Twitch: standard community guidelines apply (no specific AI disclosure req as of March 2026)

References

Issue #946 (source research PDF)
Neuro-sama incident (December 2022) — cautionary tale for unfiltered AI output
#963 (Perception Bottleneck — related pipeline)
#966 (Three-tier router — related model selection)

## Origin Extracted from the research PDF attached to #946 ("Timmy Time Integration Architecture: Eight Deep Dives into Real Deployment"), Section 6: Content Moderation. ## Problem An AI narrator generating live speech over game footage needs a content moderation pipeline to prevent harmful output (ref: Neuro-sama incident — 2-week Twitch ban for unfiltered LLM responses). ## Proposed Architecture Three-layer defense: ### Layer 1: Game-Context System Prompt - Instruct the narrator to treat Morrowind's mature themes (slavery, Skooma/drugs, assassin guilds) as "game mechanics and historical worldbuilding" - Never editorialize on real-world parallels - Per-game vocabulary whitelists ("Skooma", "slave", "Morag Tong") to prevent over-filtering ### Layer 2: Real-Time Output Filter (Llama Guard) - **Llama Guard 3 1B** (INT4 quantized): <30ms per sentence on consumer GPU - Run moderation and TTS preprocessing in parallel - On pass → TTS synthesis fires immediately - On fail → contextual fallback narration (pre-generated per game scene type) ### Layer 3: Per-Game Moderation Profiles - Threshold tuning via probability scores (ShieldGemma 2B as alternative for higher accuracy) - Distinguish "narrating in-game slavery" from "promoting real slavery" - Configurable per-game profiles ## Model Options (from research) | Model | Strength | Latency | |-------|----------|--------| | Llama Guard 3 1B (INT4) | Speed | <30ms/sentence | | ShieldGemma 2B | Accuracy (+10.8% AU-PRC) | ~50-100ms | | LEG (Lightweight Explainable Guardrail) | Ultra-fast | <8ms | | NeMo Guardrails | Orchestration framework | ~500ms with 5 rails | ## Streaming Disclosure - YouTube: use "Altered or synthetic content" toggle - Twitch: standard community guidelines apply (no specific AI disclosure req as of March 2026) ## References - Issue #946 (source research PDF) - Neuro-sama incident (December 2022) — cautionary tale for unfiltered AI output - #963 (Perception Bottleneck — related pipeline) - #966 (Three-tier router — related model selection)

claude self-assigned this 2026-03-23 01:38:50 +00:00

claude referenced this issue from a commit

2026-03-23 01:39:50 +00:00

docs: add research summary from integration architecture deep dives PDF

claude referenced this issue

2026-03-23 01:40:13 +00:00

[claude] Ingest integration architecture research and triage work (#946) #1057

claude referenced this issue

2026-03-23 01:40:26 +00:00

Ingest this research and triage any work to be done here #946

claude referenced this issue from a commit

2026-03-23 02:13:56 +00:00

feat: add content moderation pipeline (Llama Guard + game-context prompts)

claude referenced a pull request that will close this issue

2026-03-23 02:14:15 +00:00

[claude] Add content moderation pipeline (Llama Guard + game-context prompts) (#1056) #1059

claude commented

2026-03-23 02:14:23 +00:00

PR #1059 created.

Implemented the three-layer content moderation pipeline:

Game-context system prompts — per-game vocabulary whitelists (Morrowind, Skyrim) prevent over-filtering of in-game terms like Skooma, slave, Morag Tong
Real-time output filter — Llama Guard 3 1B via Ollama with regex fallback for graceful degradation
Per-game threshold tuning — configurable confidence thresholds per profile, contextual fallback narration by scene type

New module: src/infrastructure/guards/ with singleton pattern. Config via config/moderation.yaml. 32 unit tests, all passing.

PR #1059 created. Implemented the three-layer content moderation pipeline: 1. **Game-context system prompts** — per-game vocabulary whitelists (Morrowind, Skyrim) prevent over-filtering of in-game terms like Skooma, slave, Morag Tong 2. **Real-time output filter** — Llama Guard 3 1B via Ollama with regex fallback for graceful degradation 3. **Per-game threshold tuning** — configurable confidence thresholds per profile, contextual fallback narration by scene type New module: `src/infrastructure/guards/` with singleton pattern. Config via `config/moderation.yaml`. 32 unit tests, all passing.

claude closed this issue

2026-03-23 02:14:43 +00:00

claude referenced this issue from a commit

2026-03-23 02:14:44 +00:00

[claude] Add content moderation pipeline (Llama Guard + game-context prompts) (#1056) (#1059)

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1056