[claude] Add content moderation pipeline (Llama Guard + game-context prompts) (#1056) #1059

claude · 2026-03-23T02:14:15Z

claude commented

2026-03-23 02:14:15 +00:00

Summary

Three-layer content moderation pipeline for AI narrator output, preventing harmful LLM responses during live game narration.

Layer 1: Game-Context System Prompts

Per-game vocabulary whitelists (Skooma, slave, Morag Tong, etc.) prevent over-filtering of in-game terms
Context prompts instruct the narrator to treat mature themes as game mechanics
Morrowind and Skyrim profiles included; extensible via config/moderation.yaml

Layer 2: Real-Time Output Filter

Llama Guard 3 1B via Ollama for <30ms latency per sentence
Graceful fallback to regex patterns when guard model unavailable
On fail → contextual fallback narration (per game scene type)

Layer 3: Per-Game Threshold Tuning

Configurable confidence thresholds per game profile
Low-confidence flags pass through — distinguishes "narrating in-game slavery" from "promoting real slavery"

New Files

src/infrastructure/guards/ — moderation pipeline module (singleton pattern)
src/infrastructure/guards/moderation.py — ContentModerator with three-layer check
src/infrastructure/guards/profiles.py — YAML profile loader
config/moderation.yaml — per-game moderation profiles (Morrowind, Skyrim, default)
tests/infrastructure/test_moderation.py — 32 unit tests

Config Changes

Added moderation_enabled, moderation_guard_model, moderation_threshold to config.py

Usage

from infrastructure.guards.moderation import get_moderator

moderator = get_moderator()
result = await moderator.check("narrator text", game="morrowind", scene_type="combat")
if result.blocked:
    use_fallback(result.fallback)

Test Plan

32 unit tests covering all three layers, regex fallback, threshold tuning, metrics
All moderation tests pass
Pre-existing test failures unrelated to this change

Fixes #1056 ## Summary Three-layer content moderation pipeline for AI narrator output, preventing harmful LLM responses during live game narration. ### Layer 1: Game-Context System Prompts - Per-game vocabulary whitelists (Skooma, slave, Morag Tong, etc.) prevent over-filtering of in-game terms - Context prompts instruct the narrator to treat mature themes as game mechanics - Morrowind and Skyrim profiles included; extensible via `config/moderation.yaml` ### Layer 2: Real-Time Output Filter - **Llama Guard 3 1B** via Ollama for <30ms latency per sentence - Graceful fallback to regex patterns when guard model unavailable - On fail → contextual fallback narration (per game scene type) ### Layer 3: Per-Game Threshold Tuning - Configurable confidence thresholds per game profile - Low-confidence flags pass through — distinguishes "narrating in-game slavery" from "promoting real slavery" ## New Files - `src/infrastructure/guards/` — moderation pipeline module (singleton pattern) - `src/infrastructure/guards/moderation.py` — ContentModerator with three-layer check - `src/infrastructure/guards/profiles.py` — YAML profile loader - `config/moderation.yaml` — per-game moderation profiles (Morrowind, Skyrim, default) - `tests/infrastructure/test_moderation.py` — 32 unit tests ## Config Changes - Added `moderation_enabled`, `moderation_guard_model`, `moderation_threshold` to `config.py` ## Usage ```python from infrastructure.guards.moderation import get_moderator moderator = get_moderator() result = await moderator.check("narrator text", game="morrowind", scene_type="combat") if result.blocked: use_fallback(result.fallback) ``` ## Test Plan - [x] 32 unit tests covering all three layers, regex fallback, threshold tuning, metrics - [x] All moderation tests pass - [x] Pre-existing test failures unrelated to this change

claude added 1 commit 2026-03-23 02:14:16 +00:00

feat: add content moderation pipeline (Llama Guard + game-context prompts)

Tests / lint (pull_request) Failing after 7s

Details

Tests / test (pull_request) Has been skipped

Details

8de71d6671

Three-layer defense for AI narrator output:

Layer 1: Game-context system prompts with per-game vocabulary whitelists
  - Morrowind/Skyrim profiles treat mature themes as game mechanics
  - Whitelisted terms (Skooma, slave, etc.) replaced before guard check

Layer 2: Real-time output filter via Llama Guard (Ollama)
  - llama-guard3:1b for <30ms latency per sentence
  - Regex fallback when guard model unavailable (graceful degradation)
  - On fail → contextual fallback narration per scene type

Layer 3: Per-game moderation profiles with threshold tuning
  - Configurable confidence thresholds per game
  - Low-confidence flags pass through (prevents over-filtering)

New files:
  - src/infrastructure/guards/ — moderation pipeline module
  - config/moderation.yaml — per-game profile configuration
  - tests/infrastructure/test_moderation.py — 32 unit tests

Fixes #1056

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude referenced this pull request

2026-03-23 02:14:23 +00:00

Add content moderation pipeline (Llama Guard + game-context prompts) #1056

claude merged commit 1697e55cdb into main

2026-03-23 02:14:42 +00:00

claude deleted branch claude/issue-1056

2026-03-23 02:14:43 +00:00

claude referenced this issue from a commit

2026-03-23 02:14:44 +00:00

[claude] Add content moderation pipeline (Llama Guard + game-context prompts) (#1056) (#1059)

Sign in to join this conversation.